统计机器学习Lecture-4
1. What is Classification?
Classification is a type of supervised machine learning where the goal is to predict a categorical or qualitative response. Unlike regression where you predict a continuous numerical value (like a price or temperature), classification assigns an input to a specific category or class. 分类是一种监督式机器学习,其目标是预测分类或定性响应。与预测连续数值(例如价格或温度)的回归不同,分类将输入分配到特定的类别或类别。
Key characteristics:
Goal: Predict the class of a subject based on input features.
Output (Response): The output is a category, such as ‘Yes’/‘No’, ‘Spam’/‘Not Spam’, or ‘High’/‘Medium’/‘Low’.
Applications: Common examples include email spam detectors, medical diagnosis (e.g., virus carrier vs. non-carrier), and fraud detection.
- 目标:根据输入特征预测主题的类别。
- 输出(响应):输出是一个类别,例如“是”/“否”、“垃圾邮件”/“非垃圾邮件”或“高”/“中”/“低”。
- 应用:常见示例包括垃圾邮件检测器、医学诊断(例如,病毒携带者与非病毒携带者)和欺诈检测。 The example used in the slides is a credit card Default dataset. The goal is to predict whether a customer will default (‘Yes’ or ‘No’) on their payments based on their monthly income and account balance.
## Why Not Use Linear Regression?为什么不使用线性回归?
At first, it might seem possible to use linear regression for classification. For a binary (two-class) problem like the default dataset, you could code the outcomes as numbers, for example:
- Default = ‘No’ => \(y = 0\)
- Default = ‘Yes’ => \(y = 1\)
You could then fit a standard linear regression model: \(Y \approx \beta_0 + \beta_1 X\). In this context, we would interpret the prediction \(\hat{y}\) as the probability of default, so we’d be modeling \(P(Y=1|X) = \beta_0 + \beta_1 X\).
However, this approach has two major problems: 然而,这种方法有两个主要问题: 1. The Output Is Not a Probability A linear model can produce outputs that are less than 0 or greater than 1. This doesn’t make sense for a probability, which must always be between 0 and 1.
The image below is the most important one for understanding this issue. The left plot shows a linear regression line fit to the 0/1 default data. You can see the line goes below 0 and would eventually go above 1 for higher balances. The right plot shows a logistic regression curve, which always stays between 0 and 1.
- Left (Linear Regression): The straight blue line predicts probabilities < 0 for low balances.
- Right (Logistic Regression): The S-shaped blue curve correctly constrains the probability output between 0 and 1.
2. It Doesn’t Work for Multi-Class Problems If you have more than two categories (e.g., ‘mild’, ‘moderate’, ‘severe’), you might code them as 0, 1, and 2. A linear regression model would incorrectly assume that the “distance” between ‘mild’ and ‘moderate’ is the same as the distance between ‘moderate’ and ‘severe’, which is usually not a valid assumption.
1. 输出不是概率 线性模型可以产生小于 0 或大于 1 的输出。这对于概率来说毫无意义,因为概率必须始终介于 0 和 1 之间。
下图是理解这个问题最重要的图。左图显示了与 0/1 默认数据拟合的线性回归线。您可以看到,该线低于 0,并且最终会随着余额的增加而高于 1。右图显示了逻辑回归曲线,它始终保持在 0 和 1 之间。
- 左图(线性回归):蓝色直线预测低余额的概率小于 0。
- 右图(逻辑回归):S 形蓝色曲线正确地将概率输出限制在 0 和 1 之间。
2.它不适用于多类别问题 如果您有两个以上的类别(例如,“轻度”、“中度”、“重度”),您可能会将它们编码为 0、1 和 2。线性回归模型会错误地假设“轻度”和“中度”之间的“距离”与“中度”和“重度”之间的距离相同,这通常不是一个有效的假设。
## The Solution: Logistic Regression
Instead of modeling the response \(y\) directly, logistic regression models the probability that \(y\) belongs to a particular class. To solve the issue of the output not being a probability, it uses the logistic function (also known as the sigmoid function).
This function takes any real-valued input and squeezes it into an output between 0 and 1.
The formula for the probability in a logistic regression model is: \[P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\] This S-shaped function, shown in the right-hand plot above, ensures that the output is always a valid probability. We can then set a threshold (e.g., 0.5) to make the final class prediction. If \(P(Y=1|X) > 0.5\), we predict ‘Yes’; otherwise, we predict ‘No’.
## 解决方案:逻辑回归
逻辑回归不是直接对响应 \(y\) 进行建模,而是对 \(y\) 属于特定类别的概率进行建模。为了解决输出不是概率的问题,它使用了逻辑函数(也称为 S 型函数)。
此函数接受任何实值输入,并将其压缩为介于 0 和 1 之间的输出。
逻辑回归模型中的概率公式为: \[P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\] 如上图右侧所示,这个 S 形函数确保输出始终是有效概率。然后,我们可以设置一个阈值(例如 0.5)来进行最终的类别预测。如果 \(P(Y=1|X) > 0.5\),则预测“是”;否则,预测“否”。
## Data Visualization & Code in Python
The slides use R to visualize the data. The boxplots are particularly important because they show which variable is a better predictor.
Balance vs. Default: The boxplots for balance show a clear difference. The median balance for those who default (‘Yes’) is much higher than for those who do not (‘No’). This suggests balance is a strong predictor.
Income vs. Default: The boxplots for income show a lot of overlap. The median incomes for both groups are very similar. This suggests income is a weak predictor.
余额 vs. 违约:余额的箱线图显示出明显的差异。违约者(“是”)的余额中位数远高于未违约者(“否”)。这表明余额是一个强有力的预测指标。
收入 vs. 违约:收入的箱线图显示出很大的重叠。两组的收入中位数非常相似。这表明收入是一个弱的预测指标。
Here’s how you could perform similar analysis and modeling in Python
using seaborn and scikit-learn.
1 | import pandas as pd |
2. the mathematical foundation of logistic regression
This set of slides explains the mathematical foundation of logistic regression, how its parameters are estimated using Maximum Likelihood Estimation (MLE), and how an iterative algorithm called Newton-Raphson is used to perform this estimation.
逻辑回归的数学基础、如何使用最大似然估计 (MLE) 估计其参数,以及如何使用名为 Newton-Raphson 的迭代算法进行估计。
2.1 The Logistic Regression Model: From Probabilities to Log-Odds逻辑回归模型:从概率到对数几率
The core of logistic regression is transforming a linear model into a valid probability. This is done using the logistic function, also known as the sigmoid function. 逻辑回归的核心是将线性模型转换为有效的概率。这可以通过逻辑函数(也称为 S 型函数)来实现。 #### Key Mathematical Formulas
Probability of Class 1: The model assumes the probability of an observation \(\mathbf{x}\) belonging to class 1 is given by the sigmoid function: \[ P(y=1|\mathbf{x}) = \frac{1}{1 + \exp(-\beta^T \mathbf{x})} = \frac{\exp(\beta^T \mathbf{x})}{1 + \exp(\beta^T \mathbf{x})} \] This function always outputs a value between 0 and 1, making it perfect for modeling probabilities.
Odds: The odds are the ratio of the probability of an event happening to the probability of it not happening. \[ \text{Odds} = \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \exp(\beta^T \mathbf{x}) \]
Log-Odds (Logit): By taking the natural logarithm of the odds, we get a linear relationship with the predictors. This is called the logit transformation. \[ \text{logit}(P(y=1|\mathbf{x})) = \log\left(\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})}\right) = \beta^T \mathbf{x} \] This final equation is the heart of the model. It states that the log-odds of the outcome are a linear function of the predictors. This provides a great interpretation: a one-unit increase in a predictor \(x_j\) changes the log-odds by \(\beta_j\).
类别 1 的概率:该模型假设观测值 \(\mathbf{x}\) 属于类别 1 的概率由 S 型函数给出: \[ P(y=1|\mathbf{x}) = \frac{1}{1 + \exp(-\beta^T \mathbf{x})} = \frac{\exp(\beta^T \mathbf{x})}{1 + \exp(\beta^T \mathbf{x})} \] 此函数的输出值始终介于 0 和 1 之间,非常适合用于概率建模。
几率:**几率是事件发生的概率与不发生的概率之比。 \[ \text{Odds} = \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \exp(\beta^T \mathbf{x}) \]
对数概率 (Logit):通过对概率取自然对数,我们可以得到概率与预测变量之间的线性关系。这被称为logit 变换。 \[ \text{logit}(P(y=1|\mathbf{x})) = \log\left(\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})}\right) = \beta^T \mathbf{x} \] 最后一个方程是模型的核心。它指出结果的对数概率是预测变量的线性函数。这提供了一个很好的解释:预测变量 \(x_j\) 每增加一个单位,对数概率就会改变 \(\beta_j\)。
2.2 Fitting the Model: Maximum Likelihood Estimation (MLE) 拟合模型:最大似然估计 (MLE)
Unlike linear regression, which uses least squares to find the best-fit line, logistic regression uses Maximum Likelihood Estimation (MLE). The goal of MLE is to find the parameter values (the \(\beta\) coefficients) that maximize the probability of observing the actual data that we have. 与使用最小二乘法寻找最佳拟合线的线性回归不同,逻辑回归使用最大似然估计 (MLE)。MLE 的目标是找到使观测到实际数据的概率最大化的参数值(\(\beta\) 系数)。
Likelihood Function: This is the joint probability of observing all the data points in our sample. Assuming each observation is independent, it’s the product of the individual probabilities: 1.似然函数:这是观测到样本中所有数据点的联合概率。假设每个观测值都是独立的,它是各个概率的乘积: \[ L(\beta) = \prod_{i=1}^{n} P(y_i|\mathbf{x}_i) \] A clever way to write this for a binary (0/1) outcome is: \[ L(\beta) = \prod_{i=1}^{n} \frac{\exp(y_i \beta^T \mathbf{x}_i)}{1 + \exp(\beta^T \mathbf{x}_i)} \]
Log-Likelihood Function: Products are difficult to work with mathematically, so we work with the logarithm of the likelihood, which turns the product into a sum. Maximizing the log-likelihood is the same as maximizing the likelihood.
对数似然函数:乘积在数学上很难处理,所以我们使用似然的对数,将乘积转化为和。最大化对数似然与最大化似然相同。 \[ \ell(\beta) = \log(L(\beta)) = \sum_{i=1}^{n} \left[ y_i \beta^T \mathbf{x}_i - \log(1 + \exp(\beta^T \mathbf{x}_i)) \right] \] Key Takeaway: The slides correctly state that there is no explicit formula to solve for the \(\hat{\beta}\) that maximizes this function. We must find it using a numerical optimization algorithm. 没有明确的公式来求解最大化该函数的\(\hat{\beta}\)。我们必须使用数值优化算法来找到它。
2.3 The Algorithm: Newton-Raphson 算法:牛顿-拉夫森算法
The slides introduce the Newton-Raphson algorithm as the method to find the optimal \(\hat{\beta}\). It’s an efficient iterative algorithm for finding the roots of a function (i.e., where \(f(x)=0\)).
How does this apply to logistic regression? To maximize the log-likelihood function \(\ell(\beta)\), we need to find the point where its derivative (gradient) is equal to zero. So, Newton-Raphson is used to solve \(\frac{d\ell(\beta)}{d\beta} = 0\).
它是一种高效的迭代算法,用于求函数的根(即,当\(f(x)=0\)时)。
这如何应用于逻辑回归? 为了最大化对数似然函数 \(\ell(\beta)\),我们需要找到其导数(梯度)等于零的点。因此,牛顿-拉夫森法用于求解 \(\frac{d\ell(\beta)}{d\beta} = 0\)。
The General Newton-Raphson Method
The algorithm starts with an initial guess, \(x^{old}\), and iteratively refines it using the following update rule, which is based on a Taylor series approximation: \[ x^{new} = x^{old} - \frac{f(x^{old})}{f'(x^{old})} \] where \(f'(x)\) is the derivative of \(f(x)\). You repeat this step until the value of \(x\) converges.
该算法从初始估计 \(x^{old}\) 开始,并使用以下基于泰勒级数近似的更新规则迭代地对其进行优化: \[ x^{new} = x^{old} - \frac{f(x^{old})}{f'(x^{old})} \] 其中 \(f'(x)\) 是 \(f(x)\) 的导数。重复此步骤,直到 \(x\) 的值收敛。
Important Image: Newton-Raphson Example (\(x^3 - 4 = 0\))
[Image showing iterations of Newton-Raphson]
This slide is a great illustration of the algorithm’s power. * Goal: Find \(x\) such that \(f(x) = x^3 - 4 = 0\). * Function: \(f(x) = x^3 - 4\) * Derivative: \(f'(x) = 3x^2\) * Update Rule: \(x^{new} = x^{old} - \frac{(x^{old})^3 - 4}{3(x^{old})^2}\) Starting with a guess of \(x^{old} = 2\), the algorithm converges to the true answer (\(4^{1/3} \approx 1.5874\)) in just 4 steps.
- 目标:找到 \(x\),使得 \(f(x) = x^3 - 4 = 0\)。
- 函数:\(f(x) = x^3 - 4\)
- 导数:\(f'(x) = 3x^2\)
- 更新规则:\(x^{new} = x^{old} - \frac{(x^{old})^3 - 4}{3(x^{old})^2}\) 从 \(x^{old} = 2\) 的猜测开始,该算法仅用 4 步就收敛到真实答案 (\(4^{1/3} \approx 1.5874\))。
Code Understanding (Python)
The slides show Python code implementing Newton-Raphson. Let’s break down the key function.
1 | import numpy as np |
The slides show that with a good initial guess
(x0 = 0.5), the algorithm converges quickly. With a bad one
(x0 = 50), it still converges but takes many more steps.
This highlights the importance of the starting point. The slides also
show an implementation of Gradient Descent, another
popular optimization algorithm which uses the update rule
x_new = x - learning_rate * gradient.
Provide a great case study on logistic regression, particularly on the important concept of confounding variables. Here’s a summary covering the math, code, and key insights.
Core Concept: Logistic Regression 📈 # 核心概念:逻辑回归 📈
Logistic regression is a statistical method used for binary classification, which means predicting an outcome that can only be one of two things (e.g., Yes/No, True/False, 1/0).
In this example, the goal is to predict the probability that a
customer will default on a loan (Yes or No) based on
factors like their account balance, income,
and whether they are a student.
The core of logistic regression is the sigmoid (or logistic) function, which takes any real-valued number and squishes it to a value between 0 and 1, representing a probability.
\[ \hat{P}(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}} \]
- \(\hat{P}(Y=1|X)\) is the predicted probability of the outcome being “Yes” (e.g., default).
- \(\beta_0\) is the intercept.
- \(\beta_1, ..., \beta_p\) are the coefficients for each input variable (\(X_1, ..., X_p\)). The model’s job is to find the best values for these \(\beta\) coefficients.
逻辑回归是一种用于二元分类的统计方法,这意味着预测结果只能是两种情况之一(例如,是/否、真/假、1/0)。
在本例中,目标是根据客户账户“余额”、“收入”以及是否为“学生”等因素,预测客户拖欠贷款(是或否)的概率。
逻辑回归的核心是Sigmoid(或逻辑)函数,它将任何实数压缩为介于 0 和 1 之间的值,以表示概率。
\[ \hat{P}(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}} \]
- \(\hat{P}(Y=1|X)\) 是结果为“是”(例如,默认)的预测概率。
- \(\beta_0\) 是截距。
- \(\beta_1, ..., \beta_p\) 是每个输入变量 (\(X_1, ..., X_p\)) 的系数。模型的任务是找到这些 \(\beta\) 系数的最佳值。
3.1 How the Model “Learns” (Mathematical Foundation)
The slides show that the model’s coefficients (\(\beta\)) are found using an algorithm like Newton-Raphson. This is an iterative process to find the values that maximize the log-likelihood function. Think of this as finding the coefficient values that make the observed data most probable.这是一个迭代过程,用于查找最大化对数似然函数的值。可以将其视为查找使观测数据概率最大的系数值。
The key slide for this is the one titled “Newton-Raphson Iterative Algorithm”. It shows the formulas for: * The Gradient (\(\nabla\ell\)): The direction of the steepest ascent of the log-likelihood function. * The Hessian (\(H\)): The curvature of the log-likelihood function.
- 梯度 (\(\nabla\ell\)):对数似然函数最陡上升的方向。
- 黑森矩阵 (\(H\)):对数似然函数的曲率。
The updating rule is given by: \[ \beta^{new} = \beta^{old} - H^{-1}\nabla\ell \] This formula is used repeatedly until the coefficient values stop changing significantly, meaning the algorithm has converged to the best fit. This process is also referred to as Iteratively Reweighted Least Squares (IRLS). 此公式反复使用,直到系数值不再发生显著变化,这意味着算法已收敛到最佳拟合值。此过程也称为迭代重加权最小二乘法 (IRLS)。
3.2 The Puzzle: A Tale of Two Models 🕵️♂️
The most important story in these slides is how the effect of being a student changes depending on the model. This is a classic example of a confounding variable.
Model 1: Simple Logistic Regression (Default vs. Student)
When predicting default using only student status, the model
is: default ~ student
From the slides, the coefficients are: * Intercept (\(\beta_0\)): -3.5041 * student[Yes] (\(\beta_1\)): 0.4049 (positive)
The equation for the log-odds is: \[ \log\left(\frac{P(\text{default})}{1-P(\text{default})}\right) = -3.5041 + 0.4049 \times (\text{is\_student}) \]
Conclusion: The positive coefficient (0.4049) suggests that students are more likely to default than non-students. The slides calculate the probabilities: * Student Default Probability: 4.31% * Non-Student Default Probability: 2.92%
学生身份的影响如何根据模型而变化。这是一个典型的混杂变量的例子。
模型 1:简单逻辑回归(违约 vs. 学生)
仅使用学生身份预测违约时,模型为: default ~ student
幻灯片中显示的系数为: * 截距 (\(\beta_0\)): -3.5041 * 学生[是] (\(\beta_1\)): 0.4049(正)
对数概率公式为: \[ \log\left(\frac{P(\text{default})}{1-P(\text{default})}\right) = -3.5041 + 0.4049 \times (\text{is\_student}) \]
结论:正系数 (0.4049) 表明学生比非学生更有可能违约。幻灯片计算了以下概率: * 学生违约概率:4.31% * 非学生违约概率:2.92%
3.3 Model 2: Multiple Logistic Regression (Default vs. All Variables) 模型 2:多元逻辑回归(违约 vs. 所有变量)
When we add balance and income to the
model, it becomes: default ~ student + balance + income
From the slides, the new coefficients are: * Intercept (\(\beta_0\)): -10.8690 * balance (\(\beta_1\)): 0.0057 * income (\(\beta_2\)): 0.0030 * student[Yes] (\(\beta_3\)): -0.6468 (negative)
The Shocking Twist! The coefficient for
student[Yes] is now negative.
Conclusion: When we control for balance and income, students are actually less likely to default than non-students with the same balance and income.
Why the Change? The Confounding Variable Explained
The key insight, explained on the slide with multi-colored text bubbles, is that students, on average, have higher credit card balances.
- In the simple model, the
studentvariable was inadvertently capturing the risk associated with having a highbalance. The model mistakenly concluded “being a student causes default.” - In the multiple model, the
balancevariable properly accounts for the risk from a high balance. With that effect isolated, thestudentvariable can show its true, underlying relationship with default, which is negative.
This demonstrates why it’s crucial to consider multiple relevant variables to avoid drawing incorrect conclusions. The most important slides are the ones that present this paradox and its explanation.
令人震惊的转折! student[Yes]
的系数现在为负。
结论:当我们控制余额和收入时,学生实际上比具有相同余额和收入的非学生更低于违约。
为什么会有变化?混杂变量解释
幻灯片上用彩色文字气泡解释了关键的见解,即学生平均拥有更高的信用卡余额。
- 在简单模型中,“学生”变量无意中捕捉到了高余额带来的风险。该模型错误地得出了“学生身份导致违约”的结论。
- 在多元模型中,“余额”变量正确地解释了高余额带来的风险。在分离出这一影响后,“学生”变量可以显示其与违约之间真实的潜在关系,即负相关关系。
这说明了为什么考虑多个相关变量以避免得出错误结论至关重要。
Code Implementation: R vs. Python
The slides use R’s glm() (Generalized Linear Model)
function. Here’s how you would replicate this in Python.
R Code (from slides)
1 | # Simple Model |
Python Equivalent
We can use two popular libraries: statsmodels (which
gives R-style summaries) and scikit-learn (the standard for
machine learning).
1 | import pandas as pd |
4 Making Predictions and the Decision Boundary 🎯进行预测和决策边界
Once the model is trained (i.e., we have the coefficients \(\hat{\beta}\)), we can make predictions. 一旦模型训练完成(即,我们有了系数 \(\hat{\beta}\)),我们就可以进行预测了。 ## Math Behind Predictions
The model outputs the log-odds, which can be converted into a probability. A key concept is the decision boundary, which is the threshold where the model is uncertain (probability = 50%). 模型输出对数概率,它可以转换为概率。一个关键概念是决策边界,它是模型不确定的阈值(概率 = 50%)。
The Estimated Odds: The core output of the linear part of the model is the exponential of the linear equation, which gives the odds of the outcome being ‘Yes’ (or 1). 估计概率:模型线性部分的核心输出是线性方程的指数,它给出了结果为“是”(或 1)的概率。
\[ \]$$\frac{\hat{P}(y=1|\mathbf{x}_0)}{\hat{P}(y=0|\mathbf{x}_0)} = \exp(\hat{\beta}^\top \mathbf{x}_0)
\[ \]\[ \]
The Decision Rule: We classify a new observation \(\mathbf{x}_0\) by comparing its predicted odds to a threshold \(\delta\). 决策规则:我们通过比较新观测值 \(\mathbf{x}_0\) 的预测概率与阈值 \(\delta\) 来对其进行分类。
- Predict \(y=1\) if \(\exp(\hat{\beta}^\top \mathbf{x}_0) > \delta\)
- Predict \(y=0\) if \(\exp(\hat{\beta}^\top \mathbf{x}_0) < \delta\) A common default is \(\delta=1\), which means we predict ‘Yes’ if the probability is greater than 0.5.
The Linear Boundary: The decision boundary itself is where the odds are exactly equal to the threshold. By taking the logarithm, we see that this boundary is a linear equation. This is why logistic regression is called a linear classifier. 线性边界:决策边界本身就是概率恰好等于阈值的地方。取对数后,我们发现这个边界是一个线性方程。这就是逻辑回归被称为线性分类器的原因。 \[ \]$$\hat{\beta}^\top \mathbf{x} = \log(\delta)
\[ \]$$For \(\delta=1\), the boundary is simply \(\hat{\beta}^\top \mathbf{x} = 0\).
This concept is visualized perfectly in the slide titled “Linear Classifier,” which shows a straight line neatly separating two classes of data points. 题为“线性分类器”的幻灯片完美地展示了这一概念,它展示了一条直线,将两类数据点巧妙地分隔开来。
Visualizing the Confounding Effect
The most important image in this set is Figure 4.3, as it visually explains the confounding puzzle from the first set of slides.
- Right Panel (Boxplots): This shows that students (Yes) tend to have higher credit card balances than non-students (No). This is the source of the confounding.
- Left Panel (Default Rates):
- The dashed lines show the overall default
rates. The orange line (students) is higher than the blue line
(non-students). This matches our simple model
(
default ~ student). - The solid S-shaped curves show the probability of
default as a function of balance. For any given balance, the
blue curve (non-students) is slightly higher than the orange curve
(students). This means that at the same level of debt, students
are less likely to default. This matches our multiple
regression model
(
default ~ student + balance + income).
- The dashed lines show the overall default
rates. The orange line (students) is higher than the blue line
(non-students). This matches our simple model
(
This single figure brilliantly illustrates how a variable can appear to have one effect in isolation but the opposite effect when controlling for a confounding factor. * 右侧面板(箱线图):这表明学生(是)的信用卡余额往往高于非学生(否)。这就是混杂效应的根源。 * 左图(违约率): * 虚线显示总体违约率。橙色线(学生)高于蓝色线(非学生)。这与我们的简单模型(“违约 ~ 学生”)相符。 * S 形实线显示违约概率与余额的关系。对于任何给定的余额,蓝色曲线(非学生)略高于橙色曲线(学生)。这意味着在相同的债务水平下,学生违约的可能性较小。这与我们的多元回归模型(“违约 ~ 学生 + 余额 + 收入”)相符。
这张图巧妙地说明了为什么一个变量在单独使用时似乎会产生一种影响,但在控制混杂因素后却会产生相反的影响。
An Important Edge Case: Perfect Separation ⚠️
What happens if the data can be perfectly separated by a straight line? 如果数据可以用一条直线完美分离,会发生什么?
One might think this is the ideal scenario, but it causes a problem for the logistic regression algorithm. The model will try to find coefficients that make the probabilities for each class as close to 1 and 0 as possible. To do this, the magnitude of the coefficients (\(\hat{\beta}\)) must grow infinitely large. 人们可能认为这是理想情况,但它会给逻辑回归算法带来问题。模型会尝试找到使每个类别的概率尽可能接近 1 和 0 的系数。为此,系数 (\(\hat{\beta}\)) 的大小必须无限大。
The slide “Non-convergence for perfectly separated case” demonstrates this:
The Code: It generates two distinct, non-overlapping clusters of data points using Python’s
scikit-learn.Parameter Estimates Graph: It shows the
Intercept,Coefficient 1, andCoefficient 2values increasing or decreasing without limit as the algorithm runs through more iterations. They never converge to a stable value.Decision Boundary Graph: The decision boundary itself might look reasonable, but the underlying coefficients are unstable.
代码:它使用 Python 的
scikit-learn生成两个不同的、不重叠的数据点聚类。参数估计图:它显示“截距”、“系数 1”和“系数 2”的值随着算法迭代次数的增加或减少而无限增大或减小。它们永远不会收敛到一个稳定的值。
决策边界图:决策边界本身可能看起来合理,但底层系数是不稳定的。
Key Takeaway: If your logistic regression model fails to converge, the first thing you should check for is perfect separation in your training data. 关键要点:如果您的逻辑回归模型未能收敛,您应该检查的第一件事就是训练数据是否完美分离。
Code Understanding
The slides provide useful code snippets in both R and Python.
R Code (Plotting Predictions)
This code generates the plot with the two S-shaped curves (one for students, one for non-students) showing the probability of default as balance increases.
1 | // # Create a data frame for prediction with a range of balances |
Python Code (Visualizing the Decision Boundary)
This Python code uses scikit-learn and
matplotlib to create the plot showing the linear decision
boundary.
1 | # Import necessary libraries |
Other Important Remarks
The “Remarks” slide briefly mentions some key extensions:
Probit Model: An alternative to logistic regression that uses the cumulative distribution function (CDF) of the standard normal distribution instead of the sigmoid function. The results are often very similar.
Softmax Regression: An extension of logistic regression used for multi-class classification (when there are more than two possible outcomes).
Probit 模型:逻辑回归的替代方法,它使用标准正态分布的累积分布函数 (CDF) 代替 S 型函数。结果通常非常相似。
Softmax 回归:逻辑回归的扩展,用于多类分类(当存在两个以上可能结果时)。
5. Here is a summary of the slides on Linear Discriminant Analysis (LDA), including the key mathematical formulas, visual explanations, and how to implement it in Python.
The Main Idea: Classification Using Probabilities 使用概率进行分类
Linear Discriminant Analysis (LDA) is a classification method. For a given input x, it calculates the probability that x belongs to each class and then assigns x to the class with the highest probability.
It does this using Bayes’ Theorem, which provides a formula for the posterior probability \(P(Y=k | X=x)\), or the probability that the class is \(k\) given the input \(x\). 线性判别分析 (LDA) 是一种分类方法。对于给定的输入 x,它计算 x 属于每个类别的概率,然后将 x 分配给概率最高的类别。
它使用贝叶斯定理来实现这一点,该定理提供了后验概率 \(P(Y=k | X=x)\) 的公式,即给定输入 \(x\),该类别属于 \(k\) 的概率。 \[ p_k(x) = P(Y=k|X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^{K} \pi_l f_l(x)} \]
- \(p_k(x)\) is the posterior probability we want to maximize.
- \(\pi_k = P(Y=k)\) is the prior probability of class \(k\) (how common the class is overall).
- \(f_k(x) = f(x|Y=k)\) is the class-conditional probability density function of observing input \(x\) if it belongs to class \(k\).
To classify a new observation \(x\), we simply find the class \(k\) that makes \(p_k(x)\) the largest. 为了对新的观察值 \(x\) 进行分类,我们只需找到使 \(p_k(x)\) 最大的类别 \(k\) 即可。
Key Assumptions of LDA
LDA’s power comes from a specific, simplifying assumption about the data’s distribution. LDA 的强大之处在于它对数据分布进行了特定的简化假设。
Gaussian Distribution: LDA assumes that the data within each class \(k\) follows a p-dimensional multivariate normal (or Gaussian) distribution, denoted as \(X|Y=k \sim \mathcal{N}(\mu_k, \Sigma)\).
Common Covariance: A crucial assumption is that all classes share the same covariance matrix \(\Sigma\). This means that while the classes may have different centers (means, \(\mu_k\)), their shape and orientation (covariance, \(\Sigma\)) are identical.
高斯分布:LDA 假设每个类 \(k\) 中的数据服从 p 维多元正态(或高斯)分布,表示为 \(X|Y=k \sim \mathcal{N}(\mu_k, \Sigma)\)。
共同协方差:一个关键假设是所有类共享相同的协方差矩阵 \(\Sigma\)。这意味着虽然类可能具有不同的中心(均值,\(\mu_k\)),但它们的形状和方向(协方差,\(\Sigma\))是相同的。
The probability density function for a class \(k\) is: \[ f_k(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}} \exp \left( -\frac{1}{2}(x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right) \]
The image above (from your slide “Knowing normal distribution”) illustrates this. The two “bells” have different centers (different \(\mu_k\)) but similar shapes. The one on the right is “tilted,” indicating correlation between variables, which is captured in the shared covariance matrix \(\Sigma\). 上图(摘自幻灯片“了解正态分布”)说明了这一点。两个“钟”形的中心不同(\(\mu_k\) 不同),但形状相似。右边的钟形“倾斜”,表示变量之间存在相关性,这体现在共享协方差矩阵 \(\Sigma\) 中。
The Math Behind LDA: The Discriminant Function 判别函数
Since we only need to find the class \(k\) that maximizes the posterior probability \(p_k(x)\), we can simplify the math. The denominator in Bayes’ theorem is the same for all classes, so we only need to maximize the numerator: \(\pi_k f_k(x)\). 由于我们只需要找到使后验概率 \(p_k(x)\) 最大化的类别 \(k\),因此可以简化数学计算。贝叶斯定理中的分母对于所有类别都是相同的,因此我们只需要最大化分子:\(\pi_k f_k(x)\)。 Taking the logarithm (which doesn’t change which class is maximal) and removing constant terms gives us the linear discriminant function, \(\delta_k(x)\): 取对数(这不会改变哪个类别是最大值)并移除常数项,得到线性判别函数,\(\delta_k(x)\):
\[ \delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log(\pi_k) \]
This function is linear in \(x\), which is why the method is called Linear Discriminant Analysis. The decision boundary between any two classes, say class \(k\) and class \(l\), is the set of points where \(\delta_k(x) = \delta_l(x)\), which defines a linear hyperplane. 该函数关于 \(x\) 是线性的,因此该方法被称为线性判别分析。任意两个类别(例如类别 \(k\) 和类别 \(l\))之间的决策边界是满足 \(\delta_k(x) = \delta_l(x)\) 的点的集合,这定义了一个线性超平面。
The image above (from your “Graph of LDA” slide) is very important. * Left: The ellipses show the true 95% probability contours for three Gaussian classes. The dashed lines are the ideal Bayes decision boundaries, which are perfectly linear because the assumption of common covariance holds. * Right: This shows a sample of data points drawn from those distributions. The solid lines are the LDA decision boundaries calculated from the sample. They are a very good estimate of the ideal boundaries. 上图(来自您的“LDA 图”幻灯片)非常重要。 * 左图:椭圆显示了三个高斯类别的真实 95% 概率轮廓。虚线是理想的贝叶斯决策边界,由于共同协方差假设成立,因此它们是完美的线性。 * 右图:这显示了从这些分布中抽取的数据点样本。实线是根据样本计算出的 LDA 决策边界。它们是对理想边界的非常好的估计。 ***
Practical Implementation: Estimating the Parameters 实际应用:估计参数
In a real-world scenario, we don’t know the true parameters (\(\mu_k\), \(\Sigma\), \(\pi_k\)). Instead, we estimate them from our training data (\(n\) total samples, with \(n_k\) samples in class \(k\)). 在实际场景中,我们不知道真正的参数(\(\mu_k\)、\(\Sigma\)、\(\pi_k\))。相反,我们根据训练数据(\(n\) 个样本,\(n_k\) 个样本属于 \(k\) 类)来估计它们。
- Prior Probability (\(\hat{\pi}_k\)): The proportion of training samples in class \(k\). \[\hat{\pi}_k = \frac{n_k}{n}\]
- Class Mean (\(\hat{\mu}_k\)): The average of the training samples in class \(k\). \[\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i=k} x_i\]
- Common Covariance (\(\hat{\Sigma}\)): A weighted average of the sample covariance matrices for each class. This is often called the “pooled” covariance. \[\hat{\Sigma} = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T\]
- 先验概率 (\(\hat{\pi}_k\)):训练样本在 \(k\) 类中的比例。 \[\hat{\pi}_k = \frac{n_k}{n}\]
- 类别均值 (\(\hat{\mu}_k\)):训练样本在 \(k\) 类中的平均值。 \[\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i=k} x_i\]
- 公共协方差 (\(\hat{\Sigma}\)):每个类的样本协方差矩阵的加权平均值。这通常被称为“合并”协方差。 \[\hat{\Sigma} = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T\]
We then plug these estimates into the discriminant function to get \(\hat{\delta}_k(x)\) and classify a new observation \(x\) to the class with the largest score. 然后,我们将这些估计值代入判别函数,得到 \(\hat{\delta}_k(x)\),并将新的观测值 \(x\) 归类到得分最高的类别。 ***
Evaluating Performance
After training the model, we evaluate its performance using a confusion matrix. 训练模型后,我们使用混淆矩阵来评估其性能。
This matrix shows the true classes versus the predicted classes. * Diagonal elements (9644, 81) are correct predictions. * Off-diagonal elements (23, 252) are errors. 该矩阵显示了真实类别与预测类别的对比。 * 对角线元素 (9644, 81) 表示正确预测。 * 非对角线元素 (23, 252) 表示错误预测。
From this matrix, we can calculate key metrics: * Overall Error Rate: Total incorrect predictions / Total predictions. * Example: \((252 + 23) / 10000 = 2.75\%\) * Sensitivity (True Positive Rate): Correctly predicted positives / Total actual positives. It answers: “Of all the people who actually defaulted, what fraction did we catch?” * Example: \(81 / 333 = 24.3\%\). The sensitivity is \(1 - 75.7\% = 24.3\%\). * Specificity (True Negative Rate): Correctly predicted negatives / Total actual negatives. It answers: “Of all the people who did not default, what fraction did we correctly identify?” * Example: \(9644 / 9667 = 99.8\%\). The specificity is \(1 - 0.24\% = 99.8\%\).
The example in your slides shows a high error rate for “default” people (75.7%) because the classes are unbalanced—there are far fewer defaulters. This highlights the importance of looking at class-specific metrics, not just the overall error rate.
Python Code Understanding
In Python, you can easily implement LDA using the
scikit-learn library. The code conceptually mirrors the
steps we discussed.
1 | import numpy as np |
LinearDiscriminantAnalysis()creates the classifier object.lda.fit(X_train, y_train)is the core training step where the model learns the \(\hat{\pi}_k\), \(\hat{\mu}_k\), and \(\hat{\Sigma}\) parameters from the data.lda.predict(X_test)uses the learned discriminant function \(\hat{\delta}_k(x)\) to classify each sample in the test set.confusion_matrixandclassification_reportare tools to evaluate the results, just like in the slides.
6. Here is a summary of the provided slides on Linear Discriminant Analysis (LDA), focusing on mathematical concepts, Python code interpretation, and key visuals.
Core Concept: LDA for Classification
Linear Discriminant Analysis (LDA) is a classification method that models the probability that an observation belongs to a certain class. It works by finding a linear combination of features that best separates two or more classes.
The decision is based on Bayes’ theorem. For a given observation with features \(X=x\), LDA calculates the posterior probability, \(p_k(x) = Pr(Y=k|X=x)\), for each class \(k\). This is the probability that the observation belongs to class \(k\) given its features. 线性判别分析 (LDA) 是一种分类方法,它对观测值属于某个类别的概率进行建模。它的工作原理是找到能够最好地区分两个或多个类别的特征的线性组合。
该决策基于贝叶斯定理。对于特征为 \(X=x\) 的给定观测值,LDA 会计算每个类别 \(k\) 的后验概率,\(p_k(x) = Pr(Y=k|X=x)\)。这是给定观测值的特征后,该观测值属于类别 \(k\) 的概率。
By default, the Bayes classifier assigns an observation to the class with the highest posterior probability. For a binary (two-class) problem like ‘Yes’ vs. ‘No’, this means: 默认情况下,贝叶斯分类器将观测值分配给后验概率最高的类别。对于像“是”与“否”这样的二分类问题,这意味着:
- Assign to ‘Yes’ if \(Pr(Y=\text{Yes}|X=x) > 0.5\)
- Assign to ‘No’ otherwise
Modifying the Decision Threshold
The default 0.5 threshold isn’t always optimal. In many real-world scenarios, the cost of one type of error is much higher than another. For example, in credit card default prediction: 默认的 0.5 阈值并非总是最优的。在许多实际场景中,一种错误的代价远高于另一种。例如,在信用卡违约预测中:
- False Negative: Incorrectly classifying a person who will default as someone who won’t. (The bank loses money).
- False Positive: Incorrectly classifying a person who won’t default as someone who will. (The bank loses a potential customer).
A bank might decide that missing a defaulter is much worse than denying a good customer. To catch more potential defaulters, they can lower the probability threshold. 银行可能会认为错过一个违约者比拒绝一个优质客户更糟糕。为了捕捉更多潜在的违约者,他们可以降低概率阈值。
A modified rule could be: \[ Pr(\text{default}=\text{Yes}|X=x) > 0.2 \] This makes the model more “sensitive” to flagging potential defaulters, even at the cost of misclassifying more non-defaulters. 降低阈值会提高敏感度,但会降低特异性。
This decision leads to a trade-off between two key
performance metrics: * Sensitivity (True Positive
Rate): The ability to correctly identify positive cases. (e.g.,
Correctly identified defaulters / Total actual defaulters).
* Specificity (True Negative Rate): The ability to
correctly identify negative cases. (e.g.,
Correctly identified non-defaulters / Total actual non-defaulters).
这一决策会导致两个关键绩效指标之间的权衡: * 敏感度(真阳性率):正确识别阳性案例的能力。(例如,“正确识别的违约者/实际违约者总数”)。 * 特异性(真阴性率):正确识别阴性案例的能力。(例如,“正确识别的非违约者/实际非违约者总数”)。
Lowering the threshold increases sensitivity but decreases specificity. ## Python Code Explained
The slides show how to implement and adjust LDA using Python’s
scikit-learn library.
Basic LDA Implementation
1 | # Import the necessary library |
This code trains an LDA model and makes predictions using the standard 50% probability boundary.
Adjusting the Prediction Threshold
To use a custom threshold (e.g., 0.2), you don’t use the
.predict() method. Instead, you get the class probabilities
with .predict_proba() and apply the threshold manually.
1 | # 1. Get the probabilities for each class |
This is the core technique for tuning the classifier’s behavior to meet specific business needs, as demonstrated on slides 55 and 56 for both LDA and Logistic Regression.
Important Images to Understand
- Confusion Matrix (Slide 49): This table is crucial. It breaks down the model’s predictions into True Positives, True Negatives, False Positives, and False Negatives. All key metrics like error rate, sensitivity, and specificity are calculated from this matrix. 混淆矩阵(幻灯片 49):这张表至关重要。它将模型的预测分解为真阳性、真阴性、假阳性和假阴性。所有关键指标,例如错误率、灵敏度和特异性,都基于此矩阵计算得出。
- LDA Decision Boundaries (Slide 51): This plot provides a powerful visual intuition. It shows the data points for two classes and the decision boundary line. The different parallel lines show how changing the threshold from 0.5 to 0.1 or 0.9 shifts the boundary, making the model classify more or fewer points into the minority class. LDA 决策边界(幻灯片 51):这张图提供了强大的视觉直观性。它展示了两个类别的数据点和决策边界线。不同的平行线显示了将阈值从 0.5 更改为 0.1 或 0.9 时边界如何移动,从而使模型将更多或更少的点归入少数类
- Error Rate Tradeoff Curve (Slide 53): This graph is the most important for understanding the business implication of changing the threshold. It clearly shows that as the threshold changes, the error rate for one class goes down while the error rate for the other goes up. The overall error is minimized at a certain point, but that may not be the optimal point from a business perspective. 错误率权衡曲线(幻灯片 53):这张图对于理解更改阈值的业务含义至关重要。它清楚地表明,随着阈值的变化,一个类别的错误率下降,而另一个类别的错误率上升。总体误差在某个点达到最小,但从业务角度来看,这可能并非最佳点。
- ROC Curve (Slides 54 & 55): The Receiver Operating Characteristic (ROC) curve plots Sensitivity vs. (1 - Specificity) for all possible thresholds. An ideal classifier has a curve that “hugs” the top-left corner, indicating high sensitivity and high specificity. It’s a standard way to visualize and compare the overall performance of different classifiers. ROC 曲线(幻灯片 54 和 55): 接收者操作特性 (ROC) 曲线绘制了所有可能阈值的灵敏度与(1 - 特异性)的关系。理想的分类器曲线“紧贴”左上角,表示高灵敏度和高特异性。这是可视化和比较不同分类器整体性能的标准方法。
7. Here is a summary of the provided slides on Linear and Quadratic Discriminant Analysis, including the key formulas, Python code equivalents, and explanations of the important concepts.
Key Goal: Classification
Both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are classification algorithms. Their main goal is to find a decision boundary to separate different classes (e.g., “default” vs. “not default”) in the data. 线性判别分析 (LDA) 和 二次判别分析 (QDA) 都是分类算法。它们的主要目标是找到一个决策边界来区分数据中的不同类别(例如,“默认”与“非默认”)。
## Linear Discriminant Analysis (LDA)
LDA creates a linear decision boundary between classes. LDA 在类别之间创建线性决策边界。
Core Idea (Fisher’s Interpretation)
Imagine you have data points for different classes in a 3D space. Fisher’s idea is to find the best angle to shine a “flashlight” on the data to project its shadow onto a 2D wall (or a 1D line). The “best” projection is the one where the shadows of the different classes are as far apart from each other as possible, while the shadows within each class are as tightly packed as possible. 想象一下,你在三维空间中拥有不同类别的数据点。Fisher 的思想是找到最佳角度,用“手电筒”照射数据,将其阴影投射到二维墙壁(或一维线上)。 “最佳”投影是不同类别的阴影彼此之间尽可能远,而每个类别内的阴影尽可能紧密的投影。
- Maximize: The distance between the means of the projected classes (Between-Class Variance). 投影类别均值之间的距离(类间方差)。
- Minimize: The spread or variance within each
projected class (Within-Class Variance).
每个投影类别内的扩散或方差(类内方差)。 This is the most important
image for understanding the intuition behind LDA. It shows how
projecting the data onto a specific line (defined by vector
w) can make the two classes clearly separable. 这是理解LDA背后直觉的最重要图像。它展示了如何将数据投影到特定直线(由向量“w”定义)上,从而使两个类别清晰可分。
Key Mathematical Formulas
To achieve this, LDA maximizes a ratio called the Rayleigh quotient. LDA最大化一个称为瑞利商的比率。
- Within-Class Covariance (\(\hat{\Sigma}_W\)): Measures the spread of data inside each class. 类内协方差 (\(\hat{\Sigma}_W\)):衡量每个类别内部数据的扩散程度。 \[\hat{\Sigma}_W = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^\top\]
- Between-Class Covariance (\(\hat{\Sigma}_B\)): Measures the spread between the means of different classes. 类间协方差 (\(\hat{\Sigma}_B\)):衡量不同类别均值之间的差异。 \[\hat{\Sigma}_B = \sum_{k=1}^{K} n_k (\hat{\mu}_k - \hat{\mu})(\hat{\mu}_k - \hat{\mu})^\top\]
- Objective Function: Find the projection vector \(w\) that maximizes the ratio of between-class variance to within-class variance. 目标函数:找到投影向量 \(w\),使类间方差与类内方差之比最大化。 \[\max_w \frac{w^\top \hat{\Sigma}_B w}{w^\top \hat{\Sigma}_W w}\]
LDA’s Main Assumption
The key assumption of LDA is that all classes share the same covariance matrix (\(\Sigma\)). They can have different means (\(\mu_k\)), but their spread and orientation must be identical. This assumption is what results in a linear decision boundary. LDA 的关键假设是所有类别共享相同的协方差矩阵 (\(\Sigma\))。它们可以具有不同的均值 (\(\mu_k\)),但它们的散度和方向必须相同。正是这一假设导致了线性决策边界。
## Quadratic Discriminant Analysis (QDA)
QDA is a more flexible extension of LDA that creates a quadratic (curved) decision boundary. QDA 是 LDA 的更灵活的扩展,它创建了二次(曲线)决策边界。 #### Core Idea & Key Assumption
QDA starts with the same principles as LDA but drops the key assumption. QDA assumes that each class has its own unique covariance matrix (\(\Sigma_k\)). QDA 的原理与 LDA 相同,但放弃了关键假设。QDA 假设每个类别都有自己独特的协方差矩阵 (\(\Sigma_k\))。
This means each class can have its own spread, shape, and orientation. This additional flexibility allows for a more complex, curved decision boundary. 这意味着每个类别可以拥有自己的散度、形状和方向。这种额外的灵活性使得决策边界更加复杂、曲线化。
Key Mathematical Formula
The classification is made using a discrimination function, \(\delta_k(x)\). We assign a data point \(x\) to the class \(k\) for which \(\delta_k(x)\) is largest. The function for QDA is: \[\delta_k(x) = -\frac{1}{2}(x - \mu_k)^\top \Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}\log(|\Sigma_k|) + \log \pi_k\] The term containing \(x^\top \Sigma_k^{-1} x\) makes this function a quadratic function of \(x\).
## LDA vs. QDA: The Trade-Off
The choice between LDA and QDA is a classic bias-variance trade-off. 在 LDA 和 QDA 之间进行选择是典型的偏差-方差权衡。
Use LDA when:
- The assumption of a common covariance matrix is reasonable (the classes have similar shapes).
- You have a small amount of training data, as LDA is less prone to overfitting.
- Simplicity is preferred. LDA is less flexible (high bias) but has lower variance.
- 假设共同协方差矩阵是合理的(类别具有相似的形状)。
- 训练数据量较少,因为 LDA 不易过拟合。
- 简洁是首选。LDA 灵活性较差(偏差较大),但方差较小。
Use QDA when:
- The classes have clearly different shapes and spreads (different covariance matrices).
- You have a large amount of training data to properly estimate the separate covariance matrices for each class.
- QDA is more flexible (low bias) but can have high variance, meaning it might overfit on smaller datasets.
- 类别具有明显不同的形状和分布(不同的协方差矩阵)。
- 拥有大量训练数据,可以正确估计每个类别的独立协方差矩阵。
- QDA 更灵活(偏差较小),但方差较大,这意味着它可能在较小的数据集上过拟合。 Rule of Thumb: If the class variances are equal or close, LDA is better. Otherwise, QDA is better. 经验法则: 如果类别方差相等或接近,则 LDA 更佳。否则,QDA 更好。
## Code Understanding (Python Equivalent)
The slides show code in R. Here’s how you would perform LDA and
evaluate it in Python using the popular scikit-learn
library.
1 | import pandas as pd |
Understanding the ROC Curve
The ROC Curve is another important image. It helps you visualize a classifier’s performance across all possible classification thresholds. ROC 曲线 是另一个重要的图像。它可以帮助您直观地了解分类器在所有可能的分类阈值下的性能。
- The Y-axis is the True Positive Rate (Sensitivity): “Of all the actual positives, how many did we correctly identify?”
- The X-axis is the False Positive Rate: “Of all the actual negatives, how many did we incorrectly label as positive?”
- A perfect classifier would have a curve that goes straight up to the top-left corner (100% TPR, 0% FPR). The diagonal line represents a random guess. The Area Under the Curve (AUC) summarizes the model’s performance; a value closer to 1.0 is better.
- Y 轴 表示真阳性率(敏感度):“在所有实际的阳性样本中,我们正确识别了多少个?”
- X 轴 表示假阳性率:“在所有实际的阴性样本中,我们错误地将多少个标记为阳性?”
- 一个完美的分类器应该有一条直线上升到左上角的曲线(真阳性率 100%,假阳性率 0%)。对角线表示随机猜测。曲线下面积 (AUC) 概括了模型的性能;该值越接近 1.0 越好。
8. Here is a summary of the provided slides on Quadratic Discriminant Analysis (QDA), including the key formulas, code explanations with Python equivalents, and a guide to the most important images.
## Core Concept: QDA vs. LDA
The main difference between Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) lies in their assumptions about the data. 线性判别分析 (LDA) 和 二次判别分析 (QDA) 的主要区别在于它们对数据的假设。 * LDA assumes that all classes share the same covariance matrix (\(\Sigma\)). It models each class as a normal distribution with a different mean (\(\mu_k\)) but the same shape and orientation. This results in a linear decision boundary between classes. 假设所有类别共享相同的协方差矩阵 (\(\Sigma\))。它将每个类别建模为均值不同 (\(\mu_k\)) 但形状和方向相同的正态分布。这会导致类别之间出现 线性 决策边界。 * QDA is more flexible. It assumes that each class \(k\) has its own, separate covariance matrix (\(\Sigma_k\)). This allows each class’s distribution to have a unique shape, size, and orientation. This flexibility results in a quadratic decision boundary (like a parabola, hyperbola, or ellipse). 更灵活。它假设每个类别 \(k\) 都有其独立的协方差矩阵 (\(\Sigma_k\))。这使得每个类别的分布具有独特的形状、大小和方向。这种灵活性导致了二次决策边界(类似于抛物线、双曲线或椭圆)。 Analogy 💡: Imagine you’re drawing boundaries around different clusters of stars. LDA gives you only straight lines to separate the clusters. QDA gives you curved lines (circles, ellipses), which can create a much better fit if the clusters themselves are elliptical and point in different directions. 想象一下,你正在围绕不同的星团绘制边界。LDA 只提供直线来分隔星团。QDA 提供曲线(圆形、椭圆形),如果星团本身是椭圆形且指向不同的方向,则可以产生更好的拟合效果。
## The Math Behind QDA
QDA classifies a new observation \(x\) to the class \(k\) that has the highest discriminant score, \(\delta_k(x)\). The formula for this score is what makes the boundary quadratic. QDA 将新的观测值 \(x\) 归类到具有最高判别分数 \(\delta_k(x)\) 的类 \(k\) 中。该分数的公式使得边界具有二次项。
The discriminant function for class \(k\) is: \[\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}\log(|\Sigma_k|) + \log(\pi_k)\]
Let’s break it down:
- \((x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k)\): This is a quadratic term (since it involves \(x^T \Sigma_k^{-1} x\)). It measures the squared Mahalanobis distance from \(x\) to the class mean \(\mu_k\), scaled by that class’s specific covariance \(\Sigma_k\).
- \(\log(|\Sigma_k|)\): A term that penalizes classes with larger variance.
- \(\log(\pi_k)\): The prior
probability of class \(k\). This is our
initial belief about how likely class \(k\) is, before seeing the data.
- \((x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k)\):这是一个二次项(因为它涉及 \(x^T \Sigma_k^{-1} x\))。它测量从 \(x\) 到类均值 \(\mu_k\) 的平方马氏距离,并根据该类的特定协方差 \(\Sigma_k\) 进行缩放。
- \(\log(|\Sigma_k|)\):用于惩罚方差较大的类的项。
- \(\log(\pi_k)\):类 \(k\) 的先验概率。这是我们在看到数据之前对类 \(k\) 可能性的初始信念。 Because each class \(k\) has its own \(\Sigma_k\), the quadratic term doesn’t cancel out when comparing scores between classes, leading to a quadratic boundary. 由于每个类 \(k\) 都有其自己的 \(\Sigma_k\),因此在比较类之间的分数时,二次项不会抵消,从而导致二次边界。 Key Trade-off:
- If the class variances (\(\Sigma_k\)) are truly different, QDA is better.
- If the class variances are similar, LDA is often better because it’s less flexible and less likely to overfit, especially with a small number of training samples.
- 如果类方差 (\(\Sigma_k\)) 确实不同,QDA 更好。
- 如果类方差相似,LDA 通常更好,因为它的灵活性较差,并且不太可能过拟合,尤其是在训练样本数量较少的情况下。
## Code Implementation: R and Python
The slides provide R code for fitting a QDA model and evaluating it.
Below is an explanation of the R code and its equivalent in Python using
the popular scikit-learn library.
R Code (from the slides)
The code uses the MASS library for QDA and the
ROCR library for evaluation.
1 | # ######## QDA ########## |
Python Equivalent
(scikit-learn)
Here’s how you would perform the same steps in Python.
1 | import pandas as pd |
## Model Evaluation: ROC and AUC
The slides correctly emphasize using the ROC curve and the Area Under the Curve (AUC) to compare model performance.
ROC Curve (Receiver Operating Characteristic): This plot shows how well a model can distinguish between two classes. It plots the True Positive Rate (y-axis) against the False Positive Rate (x-axis) at all possible classification thresholds. A better model has a curve that is closer to the top-left corner.
AUC (Area Under the Curve): This is a single number that summarizes the entire ROC curve.
- AUC = 1: Perfect classifier.
- AUC = 0.5: A useless classifier (equivalent to random guessing).
- AUC > 0.7: Generally considered an acceptable model.
ROC 曲线(接收者操作特征):此图显示了模型区分两个类别的能力。它绘制了所有可能的分类阈值下的 真阳性率(y 轴)与 假阳性率(x 轴)的对比图。更好的模型的曲线越靠近左上角,效果就越好。
AUC(曲线下面积):这是一个概括整个 ROC 曲线的数值。
AUC = 1:完美的分类器。
AUC = 0.5:无用的分类器(相当于随机猜测)。
AUC > 0.7:通常被认为是可接受的模型。
The slides show that for the Default dataset,
LDA’s AUC (0.9647) was slightly higher than QDA’s
(0.9639). This suggests that the assumption of a common
covariance matrix (LDA) was a slightly better fit for this particular
test set, possibly because QDA’s extra flexibility wasn’t needed and it
may have slightly overfit the training data.
这表明,对于这个特定的测试集,公共协方差矩阵 (LDA)
的假设拟合度略高,可能是因为 QDA
的额外灵活性并非必需,并且可能对训练数据略微过拟合。
## Key Takeaways and Important Images
Here’s a ranking of the most important visual aids in your slides:
Slide 68/69 (Model Assumption & Formula): These are the most critical slides. They present the core theoretical difference between LDA and QDA and provide the mathematical foundation (the discriminant function formula). Understanding these is key to understanding QDA.
Slide 73 (ROC Comparison): This is the most important image for practical evaluation. It visually compares the performance of LDA and QDA side-by-side, making it easy to see which one performs better on this specific dataset. The concept of AUC is introduced here as the method for comparison.
Slide 71 (Decision Boundaries with Different Thresholds): This is an excellent conceptual image. It shows how the quadratic decision boundary (the curved lines) separates the data points. It also illustrates how changing the probability threshold (from 0.1 to 0.5 to 0.9) shifts the boundary, trading off between precision and recall.
Of course. Here is a summary of the remaining slides, which compare QDA to other popular classification models like Logistic Regression and K-Nearest Neighbors (KNN).
Visualizing the Core Trade-off: LDA vs. QDA
This is the most important concept in these slides. The choice between LDA and QDA depends entirely on the underlying structure of your data.
The slide shows two scenarios: 1. Left Plot (\(\Sigma_1 = \Sigma_2\)): When the true covariance matrices of the classes are the same, the optimal decision boundary (the Bayes classifier) is a straight line. LDA, which assumes equal covariances, creates a linear boundary that approximates this optimal boundary very well. QDA’s flexible, curved boundary is unnecessarily complex and might overfit the training data. In this case, LDA is better. 2. Right Plot (\(\Sigma_1 \neq \Sigma_2\)): When the true covariance matrices are different, the optimal decision boundary is a curve. QDA’s quadratic model can capture this non-linearity much better than LDA’s rigid linear model. In this case, QDA is better.
This perfectly illustrates the bias-variance tradeoff. LDA has higher bias (it’s less flexible) but lower variance. QDA has lower bias (it’s more flexible) but higher variance.
Comparing Performance on the “Default” Dataset
The slides compare four different models on the same classification task. Let’s look at their performance using the Area Under the Curve (AUC), where a higher score is better.
- LDA AUC: 0.9647
- QDA AUC: 0.9639
- Logistic Regression AUC: 0.9645
- K-Nearest Neighbors (KNN): The plot shows test error vs. K. The error is lowest around K=4, but it’s not directly converted to an AUC score in the slides.
Interestingly, for this particular dataset, LDA, QDA, and Logistic Regression perform almost identically. This suggests that the decision boundary for this problem is likely very close to linear, meaning the extra flexibility of QDA isn’t providing much benefit.
Pros and Cons: Which Model to Choose?
The final slide asks for a comparison of the models. Here’s a summary of their key characteristics:
| Model | Type | Decision Boundary | Key Pro | Key Con |
|---|---|---|---|---|
| Logistic Regression | Parametric | Linear | Highly interpretable, no strong assumptions about data distribution. | Inflexible; cannot capture non-linear relationships. |
| Linear Discriminant Analysis (LDA) | Parametric | Linear | More stable than Logistic Regression when classes are well-separated. | Assumes data is normally distributed with equal covariance matrices for all classes. |
| Quadratic Discriminant Analysis (QDA) | Parametric | Quadratic (Curved) | More flexible than LDA; can model non-linear boundaries. | Requires more data to estimate parameters and is more prone to overfitting. Assumes normality. |
| K-Nearest Neighbors (KNN) | Non-Parametric | Highly Non-linear | Extremely flexible; makes no assumptions about the data’s distribution. | Can be slow on large datasets and suffers from the “curse of dimensionality.” Less interpretable. |
Summary of the Comparison:
- Linear Models (Logistic Regression & LDA): Choose these for simplicity, interpretability, and when you believe the relationship between predictors and the class is linear. LDA often outperforms Logistic Regression if its normality assumptions are met.
- Non-Linear Models (QDA & KNN): Choose these when the decision boundary is likely more complex. QDA is a good middle ground, offering more flexibility than LDA without being as completely data-driven as KNN. KNN is the most flexible but requires careful tuning of the parameter K to avoid overfitting or underfitting.
9. Here is a more detailed, slide-by-slide analysis of the presentation.
4.6 Four Classification Methods: Comparison by Simulation
This section (slides 81-87) introduces four classification methods and systematically compares their performance on six different simulated datasets. The goal is to see which method works best under different conditions (e.g., linear vs. non-linear boundaries, normal vs. non-normal data).
The four methods being compared are: * Logistic Regression: A linear method that models the log-odds as a linear function of the predictors. * Linear Discriminant Analysis (LDA): Another linear method. It also assumes a linear decision boundary but makes stronger assumptions than logistic regression (e.g., that data within each class is normally distributed with a common covariance matrix). * Quadratic Discriminant Analysis (QDA): A non-linear method. It assumes the log-odds are a quadratic function, which creates a more flexible, curved decision boundary. It assumes data within each class is normally distributed, but without a common covariance matrix. * K-Nearest Neighbors (KNN): A non-parametric, highly flexible method. Two versions are tested: * KNN-1 (\(K=1\)): A very flexible (high variance) model. * KNN-CV: A tuned model where the best \(K\) is chosen via cross-validation.
比较的四种方法是: * 逻辑回归:一种将对数概率建模为预测变量线性函数的线性方法。 * 线性判别分析 (LDA):另一种线性方法。它也假设线性决策边界,但比逻辑回归做出更强的假设(例如,每个类中的数据呈正态分布,且具有共同的协方差矩阵)。 * 二次判别分析 (QDA):一种非线性方法。它假设对数概率为二次函数,从而创建一个更灵活、更弯曲的决策边界。它假设每个类中的数据服从正态分布,但没有共同的协方差矩阵。 * K最近邻 (KNN):一种非参数化、高度灵活的方法。测试了两个版本: * KNN-1 (\(K=1\)):一个非常灵活(高方差)的模型。 * KNN-CV:一个经过调整的模型,通过交叉验证选择最佳的\(K\)。
Analysis of Simulation Scenarios
The performance is measured by the test error rate (lower is better), shown in the boxplots for each scenario. 性能通过测试错误率(越低越好)来衡量,每个场景的箱线图都显示了该错误率。
- Scenario 1 (Slide 82):
- Setup: A linear decision boundary. Data is normally distributed with uncorrelated predictors.
- Result: LDA and Logistic Regression perform best. Their test error rates are low and similar. This is expected, as the setup perfectly matches their core assumption (linear boundary). QDA is slightly worse because its extra flexibility (being quadratic) is unnecessary. KNN-1 is the worst, as its high flexibility leads to high variance (overfitting).
- 结果: LDA 和逻辑回归表现最佳。它们的测试错误率较低且相似。这是意料之中的,因为设置完全符合它们的核心假设(线性边界)。QDA 略差,因为其额外的灵活性(二次方)是不必要的。KNN-1 最差,因为其高灵活性导致方差较大(过拟合)。
- Scenario 2 (Slide 83):
- Setup: Same as Scenario 1 (linear boundary, normal data), but now the two predictors have a correlation of 0.5.
- Result: Almost no change from Scenario 1. LDA and Logistic Regression are still the best. This shows that these linear methods are robust to correlation between predictors.
- 结果:与场景 1 相比几乎没有变化。LDA 和逻辑回归仍然是最佳。这表明这些线性方法对预测因子之间的相关性具有鲁棒性。
- Scenario 3 (Slide 84):
- Setup: A linear decision boundary, but the data is drawn from a t-distribution (which is non-normal and has “heavy tails,” or more extreme outliers).
- Result: Logistic Regression is the clear winner. LDA’s performance gets worse because its assumption of normality is violated by the t-distribution. QDA’s performance deteriorates significantly due to the non-normality. This highlights a key difference: logistic regression is more robust to violations of the normality assumption.
- 结果:逻辑回归明显胜出**。LDA 的性能会变差,因为 t 分布违反了其正态性假设。QDA 的性能由于非正态性而显著下降。这凸显了一个关键区别:逻辑回归对违反正态性假设的情况更稳健。
- Scenario 4 (Slide 85):
- Setup: A quadratic decision boundary. Data is normally distributed with different correlations in each class.
- Result: QDA is the clear winner by a large margin. This setup perfectly matches QDA’s assumption (quadratic boundary from normal data with different covariance structures). All other methods (LDA, Logistic, KNN) are linear or not flexible enough, so they perform poorly.
- 结果:QDA 明显胜出**,且遥遥领先。此设置完全符合 QDA 的假设(来自具有不同协方差结构的正态数据的二次边界)。所有其他方法(LDA、Logistic、KNN)都是线性的或不够灵活,因此性能不佳。
- Scenario 5 (Slide 86):
- Setup: Another quadratic boundary, but generated in a different way (using a logistic function of quadratic terms).
- Result: QDA performs best again, closely followed by the flexible KNN-CV. The linear methods (LDA, Logistic) have poor performance because they cannot capture the curve.
- 结果:QDA 再次表现最佳,紧随其后的是灵活的KNN-CV。线性方法(LDA、Logistic)性能较差,因为它们无法捕捉曲线。
- Scenario 6 (Slide 87):
- Setup: A complex, non-linear decision boundary (more complex than a simple quadratic curve).
- Result: The flexible KNN-CV method is the winner. Its non-parametric nature allows it to approximate the complex shape. QDA is not flexible enough and performs worse. This slide highlights the bias-variance trade-off: the overly simple KNN-1 is the worst, but the tuned KNN-CV is the best.
- 结果:灵活的 KNN-CV 方法胜出**。其非参数特性使其能够近似复杂的形状。 QDA 不够灵活,性能较差。这张幻灯片重点介绍了偏差-方差权衡:过于简单的 KNN-1 最差,而 调整后的 KNN-CV 最好。
4.7 R Example on Smarket Data
This section (slides 88-93) applies Logistic Regression and LDA to
the Smarket dataset from the ISLR package to
predict the stock market’s Direction (Up or Down).
本节(幻灯片 88-93)将逻辑回归和 LDA
应用于“ISLR”包中的“Smarket”数据集,以预测股市的“方向”(上涨或下跌)。
### Data Preparation (Slides 88, 89, 90)
- Load Data: The
ISLRlibrary is loaded, and theSmarketdataset is explored. It contains daily percentage returns (Lag1…Lag5for the previous 5 days,Today),Volume, and theYear. - Explore Data: A correlation matrix
(
cor(Smarket[,-9])) is computed, and a plot ofVolumeover time is generated. - Split Data: The data is split into a training set
(Years 2001-2004) and a test set (Year 2005).
train <- (Year<2005)Smarket.2005 <- Smarket[!train,]Direction.2005 <- Direction[!train]- The test set has 252 observations.
- 加载数据:加载“ISLR”库,并探索“Smarket”数据集。该数据集包含每日百分比收益率(前 5 天的“Lag1”…“Lag5”,“今日”)、“成交量”和“年份”。
- 探索数据:计算相关矩阵
(
cor(Smarket[,-9])),并生成“成交量”随时间变化的图表。 - 拆分数据:将数据拆分为训练集(年份
2001-2004)和测试集(年份 2005)。
train <- (Year<2005)Smarket.2005 <- Smarket[!train,]Direction.2005 <- Direction[!train]- 测试集包含 252 个观测值。
Model 1: Logistic Regression (All Predictors) (Slide 90)
- Model: A logistic regression model is fit on the
training data using all predictors.
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family=binomial, subset=train)
- Prediction: The model is used to predict the
direction for the 2005 test data.
glm.probs <- predict(glm.fit, Smarket.2005, type="response")- A threshold of 0.5 is used to classify: if \(P(\text{Up}) > 0.5\), predict “Up”.
- Results:
- Test Error Rate: 0.5198 (or 48.0% accuracy).
- Conclusion: This is “not good!”—it’s worse than flipping a coin. This suggests the model is either too complex or the predictors are not useful.
Model 2: Logistic Regression (Lag1 & Lag2) (Slide 91)
- Model: Based on the poor results, a simpler model
is tried, using only
Lag1andLag2.glm.fit <- glm(Direction ~ Lag1 + Lag2, data=Smarket, family=binomial, subset=train)
- Prediction: Predictions are made on the 2005 test set.
- Results:
- Test Error Rate: 0.4404 (or 55.95% accuracy). This is an improvement.
- Confusion Matrix: | | True Down | True Up | | :— | :— | :— | | Pred Down | 77 | 69 | | Pred Up | 35 | 71 |
- ROC and AUC: The ROC (Receiver Operating Characteristic) curve is plotted, and the AUC (Area Under the Curve) is calculated.
- AUC Value: 0.5584. This is very close to 0.5 (which represents a random-chance model), indicating that the model has very weak predictive power, even though its accuracy is above 50%.
Model 3: LDA (Lag1 & Lag2) (Slide 92)
- Model: LDA is now performed using the same setup:
Lag1andLag2as predictors, trained on the 2001-2004 data.library(MASS)lda.fit <- lda(Direction ~ Lag1 + Lag2, data=Smarket, subset=train)
- Prediction: Predictions are made on the 2005 test
set.
lda.pred <- predict(lda.fit, Smarket.2005)
- Results:
- Test Error Rate: 0.4404 (or 55.95% accuracy).
- Confusion Matrix: | | True Down | True Up | | :— | :— | :— | | Pred Down | 77 | 69 | | Pred Up | 35 | 71 |
- Observation: The confusion matrix and accuracy are identical to the logistic regression model.
Final Comparison (Slide 93)
- ROC and AUC for LDA: The ROC curve for the LDA model is plotted.
- AUC Value: 0.5584.
- Main Conclusion: As highlighted in the green box, “LDA has identical performance as Logistic regression!” In this specific practical example, using these two predictors, both linear methods produce the exact same confusion matrix, the same accuracy (56%), and the same AUC (0.558). This reinforces the theoretical idea that both are fitting a linear boundary.
最终比较(幻灯片 93)
- LDA 的 ROC 和 AUC:绘制了 LDA 模型的 ROC 曲线。
- AUC 值:0.5584**。
- 主要结论:如绿色方框所示,“LDA 的性能与 Logistic 回归相同!”** 在这个具体的实际示例中,使用这两个预测变量,两种线性方法都产生了完全相同的混淆矩阵、相同的准确率(56%)和相同的 AUC(0.558)。这强化了两者均拟合线性边界的理论观点。
4.7 R Example on Smarket Data (Continued)
The previous slides showed that Logistic Regression and Linear
Discriminant Analysis (LDA) had identical performance
on the Smarket dataset (using Lag1 and Lag2),
both achieving 56% test accuracy and an AUC of 0.558. The analysis now
tests a more flexible method, QDA.
Model 3: QDA (Lag1 & Lag2) (Slides 94-95)
- Model: A Quadratic Discriminant Analysis (QDA)
model is fit on the same training data (2001-2004) using only the
Lag1andLag2predictors.qda.fit <- qda(Direction ~ Lag1 + Lag2, data=Smarket, subset=train)
- Prediction: The model is used to predict the market direction for the 2005 test set.
- Results:
- Test Accuracy: The model achieves a test accuracy of 0.5992 (or 60%).
- AUC: The Area Under the Curve (AUC) for the QDA model is 0.562.
- Conclusion: As the slide highlights, “QDA has better test performance than LDA and Logistic regression!”
Smarket Example Summary
| Method | Model Type | Test Accuracy | AUC |
|---|---|---|---|
| Logistic Regression | Linear | ~56% | 0.558 |
| LDA | Linear | ~56% | 0.558 |
| QDA | Quadratic | ~60% | 0.562 |
This practical example reinforces the lessons from the simulations
(Section 4.6). The two linear methods (LDA, Logistic) had identical
performance. The more flexible, non-linear QDA model performed better,
suggesting that the true decision boundary between “Up” and “Down”
(based on Lag1 and Lag2) is not perfectly
linear.
4.8 Kernel LDA
This new section introduces an even more advanced non-linear method, Kernel LDA.
The Problem: Linear Inseparability (Slide 97)
The section starts with a clear visual example. A dataset of two concentric circles (a “donut” shape) is linearly inseparable. It is impossible to draw a single straight line to separate the inner (purple) class from the outer (yellow) class.
The Solution: The Kernel Trick (Slides 97, 99)
- Nonlinear Transformation: The data is “lifted” into a higher-dimensional feature space using a nonlinear transformation, \(x \mapsto \phi(x)\). In the example on the slide, the 2D data is transformed, and in this new space, the two classes become linearly separable.
- The “Kernel Trick”: The main idea (from slide 99)
is that we don’t need to explicitly compute this complex transformation
\(\phi(x)\). LDA (based on Fisher’s
approach) only requires inner products of the data points. The “kernel
trick” allows us to replace the inner product in the high-dimensional
feature space (\(x_i^T x_j\)) with a
simple kernel function, \(k(x_i, x_j)\), computed in the original,
low-dimensional space.
- An example of such a kernel is the Gaussian (RBF) kernel: \(k(x_i, x_j) \propto e^{-\|x_i - x_j\|^2 / \sigma^2}\).
Academic Foundations (Slide 98)
This method is based on foundational academic papers that generalized linear methods using kernels: * Fisher discriminant analysis with kernels (Mika, 1999) * Generalized Discriminant Analysis Using a Kernel Approach (Baudat, 2000) * Kernel principal component analysis (Schölkopf, 1997)
In short, Kernel LDA is an extension of LDA that uses the kernel trick to find a linear boundary in a high-dimensional feature space, which corresponds to a highly non-linear boundary in the original space.