1.
Linear Model Selection and Regularization 线性模型选择与正则化
Summary of Core Concepts
Chapter 6: Linear Model Selection and
Regularization, focusing specifically on Section 6.1:
Subset Selection.
第六章:线性模型选择与正则化,6.1节:子集选择
The Problem: You have a dataset with many
potential predictor variables (features). If you include all of them
(like Model 1 with \(p\) predictors in slide
...221320.png), you risk including “noise” variables. These
irrelevant features can decrease model accuracy (overfitting) and make
the model difficult to interpret.
数据集包含许多潜在的预测变量(特征)。如果包含所有这些变量(例如幻灯片“…221320.png”中带有\(p\)个预测变量的模型1),则可能会包含“噪声”变量。这些不相关的特征会降低模型的准确率(过拟合),并使模型难以解释。
The Goal: Identify a smaller subset of variables
that are truly related to the response. This creates a simpler, more
interpretable, and often more accurate model (like Model
2 with \(q\) predictors).
找出一个与响应真正相关的较小变量子集。这将创建一个更简单、更易于解释且通常更准确的模型(例如带有\(q\)个预测变量的模型2)。
The Main Method Discussed: Best Subset
Selection
主要讨论的方法:最佳子集选择 This is an
exhaustive search algorithm. It checks every possible
combination of predictors to find the “best” model. With \(p\) variables, this means checking \(2^p\) total models.
这是一种穷举搜索算法。它检查所有可能的预测变量组合,以找到“最佳”模型。对于
\(p\) 个变量,这意味着需要检查总共
\(2^p\) 个模型。
The algorithm (from slide ...221333.png) works in three
steps:
Step 1: Fit the “null model” \(M_0\), which has no predictors (it just
predicts the average of the response). 拟合“空模型”\(M_0\),它没有预测变量(它只预测响应的平均值)。
Step 2: For each \(k\) (from 1 to \(p\)):
Fit all \(\binom{p}{k}\) models
that contain exactly \(k\) predictors.
(e.g., fit all models with 1 predictor, then all models with 2
predictors, etc.).
From this group, select the single best model for that size
\(k\). This “best” model is the
one with the highest \(R^2\) (or lowest
RSS - Residual Sum of Squares) on the training
data. Call this model \(M_k\).
Step 3: You now have \(p+1\) models: \(M_0, M_1, \dots, M_p\). You must select the
single best one from this list. To do this, you cannot
use training \(R^2\) (as it will always
pick the biggest model \(M_p\)).
Instead, you must use a metric that estimates test error, such
as: 现在你有 \(p+1\)
个模型:\(M_0, M_1, \dots,
M_p\)。你必须从列表中选择一个最佳模型。为此,你不能**使用训练
\(R^2\)(因为它总是会选择最大的模型
\(M_p\))。相反,你必须使用一个能够估计测试误差的指标,例如:
Cross-Validation (CV) 交叉验证 (CV) (This is what
the Python code uses)
AIC (Akaike Information Criterion
赤池信息准则)
BIC (Bayesian Information Criterion
贝叶斯信息准则)
Adjusted \(R^2\) 调整后的
\(R^2\)
Key Takeaway: The slides show this “subset
selection” concept can be applied beyond linear models. The
Python code demonstrates this by applying best subset selection to a
K-Nearest Neighbors (KNN) Regressor, a non-linear
model.“子集选择”的概念可以应用于线性模型之外。
This section directly answers the questions posed on your slides.
How to compare which model
is better?
(From slides ...221320.png and
...221326.png)
You cannot use training error (like \(R^2\) or RSS) to compare models with
different numbers of predictors. A model with more predictors
will almost always have a better training score, even if those
extra predictors are just noise. This is called
overfitting. 不能使用训练误差(例如
\(R^2\) 或
RSS)来比较具有不同数量预测变量的模型。具有更多预测变量的模型几乎总是具有更好的训练分数,即使这些额外的预测变量只是噪声。这被称为过拟合。
To compare models of different sizes (like Model 1 vs. Model 2, or
\(M_2\) vs. \(M_5\)), you must use a
method that estimates test error (how the model
performs on new, unseen data). The slides mention:
要比较不同大小的模型(例如模型 1 与模型 2,或 \(M_2\) 与 \(M_5\)),您必须使用一种估算测试误差(模型在新的、未见过的数据上的表现)的方法。
Cross-Validation (CV): This is the gold
standard. You split your data into “folds,” train the model on some
folds, and test it on the remaining fold. You repeat this and average
the test scores. The model with the best (e.g., lowest) average CV error
is chosen.
将数据分成“折叠”,在一些折叠上训练模型,然后在剩余的折叠上测试模型。重复此操作并取测试分数的平均值。选择平均
CV 误差最小(例如,最小)的模型。
AIC & BIC: These are mathematical
adjustments to the training error (like RSS) that add a penalty
for having more predictors. They balance model fit with model
complexity. 这些是对训练误差(如
RSS)的数学调整,会因预测变量较多而增加惩罚。它们平衡了模型拟合度和模型复杂度。
Why use \(R^2\) in Step 2?
(From slide ...221333.png)
In Step 2, you are only comparing models of the same
size (i.e., all models that have exactly \(k\) predictors). For models with the same
number of parameters, a higher \(R^2\)
(or lower RSS) on the training data directly corresponds to a better
fit. You don’t need to penalize for complexity because all models being
compared have the same complexity.
只比较大小相同的模型(即所有恰好具有 \(k\)
个预测变量的模型)。对于参数数量相同的模型,训练数据上更高的 \(R^2\)(或更低的
RSS)直接对应着更好的拟合度。您不需要对复杂度进行惩罚,因为所有被比较的模型都具有相同的复杂度。
Why can’t we use
training error in Step 3?
(From slide ...221333.png)
In Step 3, you are comparing models of different
sizes (\(M_0\) vs. \(M_1\) vs. \(M_2\), etc.). As you add predictors, the
training \(R^2\) will always
go up (or stay the same), and the training RSS will always go
down (or stay the same). If you used \(R^2\) to pick the best model in Step 3, you
would always pick the most complex model \(M_p\), which is almost certainly overfit.
将比较不同大小的模型(例如 \(M_0\) vs. \(M_1\) vs. \(M_2\) 等)。随着您添加预测变量,训练 \(R^2\)
将始终上升(或保持不变),而训练 RSS
将始终下降(或保持不变)。如果您在步骤 3 中使用 \(R^2\)
来选择最佳模型,那么您始终会选择最复杂的模型 \(M_p\),而该模型几乎肯定会过拟合。
Therefore, you must use a metric that estimates test error
(like CV) or penalizes for complexity (like AIC, BIC, or Adjusted \(R^2\)) to find the right balance between
fit and simplicity. 因此,您必须使用一个可以估算测试误差(例如
CV)或惩罚复杂度(例如 AIC、BIC 或调整后的 \(R^2\))的指标来找到拟合度和简单性之间的平衡。
Code Analysis
The Python code (slides ...221249.jpg and
...221303.jpg) implements the Best Subset
Selection algorithm using KNN Regression.
Key Functions
main():
Loads Data: Reads the Credit.csv
file.
Preprocesses Data:
Converts categorical features (‘Gender’, ‘Student’, ‘Married’,
‘Ethnicity’) into numerical ones (dummy variables).
将分类特征(“性别”、“学生”、“已婚”、“种族”)转换为数值特征(虚拟变量)。
Creates the feature matrix X and target variable
y (‘Balance’). 创建特征矩阵 X 和目标变量
y(“余额”)。
Scales the features using
StandardScaler. This is crucial for KNN, which is sensitive
to the scale of features. 用 StandardScaler
对特征进行缩放。这对于 KNN
至关重要,因为它对特征的缩放非常敏感。
Adds Noise (in the second example): Slide
...221303.jpg shows code that adds 20 new “noisy”
columns to the data. This is to test if the selection algorithm is
smart enough to ignore them. 向数据中添加 20
个新的“噪声”列的代码。这是为了测试选择算法是否足够智能,能够忽略它们。
Runs Selection: Calls
best_subset_selection_parallel to do the main work.
Prints Results: Finds the best subset (lowest
error) and prints the top 20 best-performing subsets.
找到最佳子集(误差最小),并打印出表现最佳的前 20 个子集。
Final Evaluation: It re-trains a KNN model on
only the best subset and calculates the final cross-validated
RMSE. 仅基于最佳子集重新训练 KNN 模型,并计算最终的交叉验证 RMSE。
evaluate_subset(subset, ...):
This is the “worker” function. It’s called for every single
possible subset.
It takes a subset (a list of feature names, e.g.,
['Income', 'Limit']).
It creates a new X_subset containing only
those columns.
It runs 5-fold cross-validation (cross_val_score) on a
KNN model using this X_subset.
It uses 'neg_mean_squared_error' as the metric. This is
negative MSE; a higher score (closer to 0) is better.
它会创建一个新的“X_subset”,仅包含这些列。 它会使用此“X_subset”在
KNN 模型上运行 5 倍交叉验证(“cross_val_score”)。
它使用“neg_mean_squared_error”作为度量标准。这是负
MSE;更高*的分数(越接近 0)越好。
It returns the subset and its average CV score.
best_subset_selection_parallel(model, ...):
This is the “manager” function.这是“管理器”函数。
It iterates from k=1 up to the total number of
features.它从“k=1”迭代到特征总数。
For each k, it generates all combinations of
features of that size (this is the \(\binom{p}{k}\) part).
对于每个“k”,它会生成该大小的特征的所有组合(这是 \(\binom{p}{k}\) 部分)。
It uses Parallel and delayed (from
joblib) to run evaluate_subset for all these
combinations in parallel, speeding up the process
significantly. 它使用 Parallel 和
delayed(来自
joblib)对所有这些组合并行运行
evaluate_subset,从而显著加快了处理速度。
It collects all the results and returns
them.它收集所有结果并返回。
Analysis of the Output
Slide ...221255.png (Original Data):
The code runs subset selection on the original dataset.
The “Top 20 Best Feature Subsets” are shown. The CV scores are
negative (they are neg_mean_squared_error), so the scores
closest to zero (smallest magnitude) are best.
The Best feature subset is found to be
('Income', 'Limit', 'Rating', 'Student').
The final cross-validated RMSE for this model is
105.41.
Slide ...221309.png (Data with 20 Noisy
Variables):
The code is re-run after adding 20 useless “Noisy” features.
The algorithm still works. It correctly identifies that the
“Noisy” variables are useless.
The Best feature subset is now
('Income', 'Limit', 'Student'). (Note: ‘Rating’ was
dropped, likely because it’s highly correlated with ‘Limit’, and the
noisy data made the simpler model perform slightly better in CV).
The final RMSE is 114.94. This is higher
than the original 105.41, which is expected—the presence of so many
noise variables makes the selection problem harder, but the final model
is still good and, most importantly, it successfully excluded all 20
noisy features. 最终的 RMSE 为 114.94。这比最初的
105.41更高,这是预期的——如此多的噪声变量的存在使得选择问题更加困难,但最终模型仍然很好,最重要的是,它成功地排除了所有
20 个噪声特征。
Conceptual Overview: The “Why”
Slides cover Chapter 6: Linear Model Selection and
Regularization, which is all about a fundamental trade-off in
machine learning: the bias-variance trade-off.
该部分主要讨论机器学习中的一个基本权衡:偏差-方差权衡。
The Problem (Slide ...221320.png):
Imagine you have a dataset with 50 predictors (\(p=50\)). You want to predict a response
\(y\). 假设你有一个包含 50
个预测变量(p=50)的数据集。你想要预测响应 \(y\)。
Model 1 (Full Model): You use all 50 predictors.
This model is very flexible. It will fit the
training data extremely well, resulting in a low
bias. However, it’s highly likely that many of those 50
predictors are just “noise” (random, unrelated variables). By fitting to
this noise, the model will be overfit. When you show it
new, unseen data (the test data), it will perform poorly. This
is called high variance. 你使用了所有 50
个预测变量。这个模型非常灵活。它能很好地拟合训练数据,从而产生较低的偏差。然而,这
50
个预测变量中很可能有很多只是“噪声”(随机的、不相关的变量)。由于拟合这些噪声,模型会过拟合。当你向它展示新的、未见过的数据(测试数据)时,它的表现会很差。这被称为高方差。
Model 2 (Subset Model): You intelligently select
only the 3 predictors (\(q=3\)) that
are actually related to \(y\).
This model is less flexible. It won’t fit the training data as
perfectly as Model 1 (it has higher bias). But, because
it’s not fitting the noise, it will generalize much better to
new data. It will have a much lower variance, and thus
a lower overall test error. 你智能地只选择与 \(y\)真正相关的 3 个预测变量 (\(q=3\))。这个模型的灵活性较差。它对
训练数据 的拟合度不如模型 1
完美(它的偏差更高)。但是,由于它对噪声的拟合度更高,因此对新数据的泛化能力会更好。它的方差会更低,因此总体的测试误差也会更低。
The Goal: The goal is to find the model that has
the lowest test error. We need a formal method to
find the best subset (like Model 2) without just guessing.
目标是找到测试误差**最低的模型。我们需要一个正式的方法来找到最佳子集(例如模型
2),而不是仅仅靠猜测。
Two Main Strategies (Slide
...221314.png):
Subset Selection (Section 6.1): This is what
we’re focused on. It’s an “all-or-nothing” approach. You either
keep a variable in the model or you discard it
completely. The “Best Subset Selection” algorithm is the most extreme,
“brute-force” way to do this.
是我们关注的重点。这是一种“全有或全无”的方法。你要么在模型中“保留”一个变量,要么“彻底丢弃”它。“最佳子集选择”算法是最极端、最“暴力”的做法。
Shrinkage/Regularization (Section 6.2): This is
a more subtle approach (e.g., Ridge Regression, LASSO). Instead of
discarding variables, you keep all \(p\) variables but add a penalty to the
model that “shrinks” the coefficients (\(\beta\)) of the useless variables towards
zero.
这是一种更巧妙的方法(例如,岭回归、LASSO)。你不是丢弃变量,而是保留所有
\(p\)
个变量,但会给模型添加一个惩罚项,将无用变量的系数(\(\beta\))“收缩”到零。
Questions 🎯
Q1: “How to compare
which model is better?”
(From slides ...221320.png and
...221326.png)
This is the most important question. You cannot use
metrics based on training data (like \(R^2\) or RSS - Residual Sum of Squares) to
compare models with different numbers of predictors.
这是最重要的问题。您不能使用基于训练数据的指标(例如
R^2 或 RSS - 残差平方和)来比较具有不同数量预测变量的模型。
The Trap: A model with more predictors will
always have a higher \(R^2\)
(or lower RSS) on the data it was trained on. \(R^2\) will always increase as you
add variables, even if they are pure noise. If you used \(R^2\) to compare a 3-predictor model to a
10-predictor model, the 10-predictor model would always look
better on paper, even if it’s terribly overfit.
具有更多预测变量的模型在其训练数据上总是具有更高的
R^2(或更低的 RSS)。随着变量的增加,R^2
会总是增加,即使这些变量是纯噪声。如果您使用 R^2 来比较 3
个预测变量的模型和 10 个预测变量的模型,那么 10
个预测变量的模型在纸面上总是看起来更好,即使它严重过拟合。
The Correct Way: You must use a metric that
estimates the test error. The slides and code show two
ways:您必须使用一个能够估计测试误差的指标。
Cross-Validation (CV): This is the method used in
your Python code. It works by:
Splitting your training data into \(k\) “folds” (e.g., 5 folds).
将训练数据拆分成 \(k\) 个“折叠”(例如 5
个折叠)。
Training the model on 4 folds and testing it on the 5th fold.
使用其中 4 个折叠训练模型,并使用第 5 个折叠进行测试。
Repeating this 5 times, so each fold gets to be the test set once.
重复此操作 5 次,使每个折叠都作为测试集一次。
Averaging the 5 test errors. 对 5 个测试误差求平均值。 This gives
you a robust estimate of how your model will perform on unseen
data. You then choose the model with the best (lowest) average CV
error.
这可以让你对模型在未见数据上的表现有一个稳健的估计。然后,你可以选择平均
CV 误差最小(最佳)的模型。
Mathematical Adjustments (AIC, BIC, Adjusted \(R^2\)): These are formulas that
take the training error (like RSS) and add a penalty for each
predictor (\(k\)) you add.
\(AIC \approx RSS +
2k\sigma^2\)
\(BIC \approx RSS +
\log(n)k\sigma^2\) A model with more predictors (larger \(k\)) gets a bigger penalty. To be chosen, a
more complex model must significantly improve the RSS to
overcome this penalty. 预测变量越多(k
越大)的模型,惩罚越大。要被选中,更复杂的模型必须显著提升 RSS
以克服此惩罚。
Q2: “Why using \(R^2\) for step 2?”
(From slide ...221333.png)
Step 2 of the “Best Subset Selection” algorithm
says: “For \(k = 1, \dots, p\): Fit all
\(\binom{p}{k}\) models… Pick the best
model, that with the largest \(R^2\), …
and call it \(M_k\).” “对于 \(k = 1, \dots, p\):拟合所有 \(\binom{p}{k}\) 个模型……选择具有最大 \(R^2\) 的最佳模型……并将其命名为 \(M_k\)。”
The Reason: In Step 2, you are only
comparing models of the same size. For example, when
\(k=3\), you are comparing all possible
3-predictor models: 步骤 2
中,您仅比较**相同大小的模型。例如,当 \(k=3\) 时,您将比较所有可能的 3
预测变量模型:
Model A: (\(X_1, X_2, X_3\))
Model B: (\(X_1, X_2, X_4\))
Model C: (\(X_1, X_3, X_5\))
…and so on.
Since all these models have the exact same complexity (they all
have \(k=3\) predictors), there is no
risk of unfairly favoring a more complex model. Therefore, you are free
to use a training metric like \(R^2\)
(or RSS). The model with the highest \(R^2\) is, by definition, the one that
best fits the training data for that specific size \(k\).
由于所有这些模型都具有完全相同的复杂度(它们都具有 \(k=3\)
个预测变量),因此不存在不公平地偏向更复杂模型的风险。因此,您可以自由使用像
\(R^2\)(或
RSS)这样的训练指标。根据定义,具有最高 \(R^2\) 的模型就是在特定大小 \(k\)
下与训练数据拟合度最高的模型。
Q3:
“Cannot use training error in Step 3.” Why not? “步骤 3
中不能使用训练误差。” 为什么?
(From slide ...221333.png)
Step 3 says: “Select a single best model from \(M_0, M_1, \dots, M_p\) by cross validation,
AIC, or BIC.”“通过交叉验证、AIC 或 BIC,从 \(M_0、M_1、\dots、M_p\)
中选择一个最佳模型。”
The Reason: In Step 3, you are now comparing
models of different sizes. You are comparing the best
1-predictor model (\(M_1\)) vs. the
best 2-predictor model (\(M_2\))
vs. the best 3-predictor model (\(M_3\)), and so on, all the way up to \(M_p\). 在步骤 3
中,您正在比较不同大小的模型。您正在比较最佳的单预测模型
(\(M_1\))、最佳的双预测模型 (\(M_2\)) 和最佳的三预测模型 (\(M_3\)),依此类推,直到 \(M_p\)。
As explained in Q1, if you used a training error metric like \(R^2\) here, the \(R^2\) would just keep going up, and you
would always select the largest, most complex model, \(M_p\). This completely defeats the purpose
of model selection. 如问题 1 所述,如果您在此处使用像 \(R^2\) 这样的训练误差指标,那么 \(R^2\)
会持续上升,并且您总是会选择最大、最复杂的模型 \(M_p\)。这完全违背了模型选择的目的。
Therefore, in Step 3, you must use a method that estimates
test error (like Cross-Validation) or one that
penalizes for complexity (like AIC or BIC) to find the
“sweet spot” model that balances fit and simplicity. 因此,在步骤 3
中,您必须使用一种估算测试误差的方法(例如交叉验证)或惩罚复杂性的方法(例如
AIC 或
BIC),以找到在拟合度和简单性之间取得平衡的“最佳点”模型。
Mathematical Deep Dive 🧮
\(Y = \beta_0 + \beta_1X_1 + \dots
+ \beta_pX_p + \epsilon\): The full linear model. The
goal of subset selection is to find a subset of \(X_j\)’s where \(\beta_j \neq 0\) and set all other \(\beta\)’s to 0.
完整的线性模型。子集选择的目标是找到 \(X_j\) 的一个子集,其中 $_j 等于
0,并将所有其他 \(\beta\) 设置为
0。
\(2^p\)
combinations: (Slide ...221333.png) This is the
total number of models you have to check. For each of the \(p\) variables, you have two choices: either
it is IN the model or it is
OUT.这是你需要检查的模型总数。对于每个 \(p\)
个变量,你有两个选择:要么它在模型内部,要么它在模型外部。
Example: \(p=3\) (variables \(X_1, X_2, X_3\))
The \(2^3 = 8\) possible models
are:
{} (The null model, \(M_0\))
{ \(X_1\) }
{ \(X_2\) }
{ \(X_3\) }
{ \(X_1, X_2\) }
{ \(X_1, X_3\) }
{ \(X_2, X_3\) }
{ \(X_1, X_2, X_3\) } (The full
model, \(M_3\))
This is why this method is called an “exhaustive
search”. It literally checks every single one. For \(p=20\), \(2^{20}\) is over a million
models!这就是该方法被称为“穷举搜索”的原因。它实际上会检查每一个模型。对于
\(p=20\),\(2^{20}\) 就超过一百万个模型!
\(\binom{p}{k} =
\frac{p!}{k!(p-k)!}\): (Slide
...221333.png) This is the “combinations” formula. It tells
you how many models you fit in Step 2 for a specific
\(k\).这是“组合”公式。它告诉你,对于特定的
\(k\),在步骤 2中,你拟合了
多少 个模型。
Example: \(p=10\) total
predictors.
For \(k=1\): You fit \(\binom{10}{1} = 10\) models.
For \(k=2\): You fit \(\binom{10}{2} = \frac{10 \times 9}{2 \times 1} =
45\) models.
For \(k=3\): You fit \(\binom{10}{3} = \frac{10 \times 9 \times 8}{3
\times 2 \times 1} = 120\) models.
…and so on. The sum of all these \(\binom{p}{k}\) from \(k=0\) to \(k=p\) equals \(2^p\).
Detailed Code Analysis 💻
Your slides show Python code that applies the Best Subset
Selection algorithm to a KNN Regressor. This
is a great example of how the selection algorithm is
independent of the model type (as mentioned in slide
...221314.png).
Key Functions
main()
Load & Preprocess: Reads
Credit.csv. The most important step here is converting
categorical text (like ‘Male’/‘Female’) into numbers (1/0).
Scale Data:scaler = StandardScaler()
and X_scaled = scaler.fit_transform(X).
WHY? This is CRITICAL for KNN. KNN
works by measuring distance. If ‘Income’ (e.g., 50,000) is on a vastly
different scale than ‘Cards’ (e.g., 3), the ‘Income’ feature will
completely dominate the distance calculation, making ‘Cards’ irrelevant.
Scaling resizes all features to have a mean of 0 and standard deviation
of 1, so they all contribute fairly.
Handle Noisy Data (Slide
...221303.jpg): This version of the code
intentionally adds 20 columns of useless, random numbers. This
is a test to see if the algorithm is smart enough to ignore them.
Run Selection:results_df = best_subset_selection_parallel(...). This
function does all the heavy lifting (explained next).
Find Best Model:results_df.sort_values(by='CV_Score', ascending=False).
WHY ascending=False? The code uses the
metric 'neg_mean_squared_error'. This is MSE, but negative
(e.g., -15000). A better model has an error closer to 0 (e.g.,
-10000). Since -10000 is greater than -15000, you sort in
descending (high-to-low) order to put the best models at the top.
Final Evaluation (Step 3):final_scores = cross_val_score(knn, X_best, y, ...)
This is the implementation of Step 3. It takes only the
single best subset (X_best) and runs a new
cross-validation on it. This gives a final, unbiased estimate of how
good that one model is.
Print RMSE:final_rmse = np.sqrt(-final_scores). It converts the
negative MSE back into a positive RMSE (Root Mean Squared Error), which
is in the same units as the target \(y\) (in this case, ‘Balance’ in
dollars).
best_subset_selection_parallel(model, ...)
This is the “manager” function. It implements the loop from Step
2.
for k in range(1, n_features + 1): This is the loop
“For \(k = 1, \dots, p\)”.
subsets = list(combinations(feature_names, k)): This
generates the \(\binom{p}{k}\)
combinations for the current \(k\).
results = Parallel(n_jobs=n_jobs)(...): This is a
non-core, “speed-up” command. It uses the joblib library to
run the evaluations on all your computer’s CPU cores at once (in
parallel). Without this, checking millions of models would take
days.
subset_scores = ... [delayed(evaluate_subset)(...) ...]
This line farms out the actual work to the
evaluate_subset function for every single subset.
evaluate_subset(subset, ...)
This is the “worker” function. It gets called thousands or millions
of times.
Its job is to evaluate one single subset (e.g.,
('Income', 'Limit', 'Student')).
X_subset = X[list(subset)]: It slices the data to get
only these columns.
scores = cross_val_score(model, X_subset, ...):
This is the most important line. It takes the subset
and performs a full 5-fold cross-validation on it.
return (subset, np.mean(scores)): It returns the subset
and its average CV score.
Summary
of Outputs (Slides ...221255.png &
...221309.png)
Original Data (Slide ...221255.png):
Best Subset:('Income', 'Limit', 'Rating', 'Student')
Final RMSE: ~105.4
Data with 20 “Noisy” Variables (Slide
...221309.png):
Best Subset:('Income', 'Limit', 'Student')
Result: The algorithm successfully
identified that all 20 “Noisy” variables were useless and
excluded every single one of them from the best
models.
Final RMSE: ~114.9
Key Takeaway: The RMSE is slightly higher, which
makes sense because the selection problem was much harder. But the
method worked perfectly. It filtered all the “noise” and found
a simple, powerful model, just as the theory on slide
...221320.png predicted.
2.
The Core Problem: Training Error vs. Test Error 核心问题:训练误差
vs. 测试误差
The central theme of these slides is finding the “best” model. The
problem is that a model with more predictors (more complex) will
always fit the data it was trained on better. This is a trap.
寻找“最佳”模型。问题在于,预测因子越多(越复杂)的模型总是能更好地拟合训练数据。这是一个陷阱。
Training Error: How well the model fits the data we
used to build it. \(R^2\) and
\(RSS\) measure this.
模型与我们构建模型时所用数据的拟合程度。\(R^2\) 和 \(RSS\) 衡量了这一点。
Test Error: How well the model predicts new, unseen
data. This is what we actually care about. A model that is too
complex (e.g., has 10 predictors when only 3 are useful) will have low
training error but very high test error. This is called
overfitting.
模型预测新的、未见过的数据的准确程度。这才是我们真正关心的。过于复杂的模型(例如,有
10 个预测因子,但只有 3
个有用)的训练误差会很低,但测试误差会很高。这被称为过拟合。
The goal is to choose a model that has the lowest test
error. The metrics below (Adjusted \(R^2\), AIC, BIC) are all attempts to
estimate this test error without having to actually collect new
data. They do this by adding a penalty for complexity.
目标是选择一个具有最低测试误差的模型。以下指标(调整后的 \(R^2\)、AIC、BIC)都是在无需实际收集新数据的情况下尝试估计此测试误差。他们通过增加复杂度惩罚来实现这一点。
Basic Metrics (Measures of
Fit)
These formulas from slide 13 describe how well a model fits the
training data.
Concept: This is the most basic building block.
It’s the difference between the actual observed value (\(y_i\)) and the value your model
predicted (\(\hat{y}_i\)). It
is the “error” for a single data point.
这是最基本的构建块。它是实际观测值 (\(y_i\)) 与模型*预测值 (\(\hat{y}_i\))
之间的差值。它是单个数据点的“误差”。
Concept: This is the overall measure of model
error. You square all the individual errors (residues) to make them
positive and then add them all up.
这是模型误差的总体度量。将所有单个误差(残差)平方,使其为正,然后将它们全部相加。
Goal: The entire process of linear regression
(called “Ordinary Least Squares”) is designed to find the \(\hat{\beta}\) coefficients that make this
RSS value as small as possible.
整个线性回归过程(称为“普通最小二乘法”)旨在找到使RSS
值尽可能小的 \(\hat{\beta}\)
个系数。
The Flaw 缺陷:\(RSS\) will always decrease (or
stay the same) as you add more predictors (\(p\)). A model with all 10 predictors will
have a lower \(RSS\) than a model with
9, even if that 10th predictor is useless. Therefore, \(RSS\) is useless for choosing
between models of different sizes. 随着预测变量 (\(p\)) 的增加,\(RSS\)
总是会减小(或保持不变)。一个包含所有 10 个预测变量的模型的 \(RSS\) 会低于一个包含 9
个预测变量的模型,即使第 10 个预测变量毫无用处。因此,\(RSS\)
对于在不同规模的模型之间进行选择毫无用处。
Concept: This metric reframes \(RSS\) into a more interpretable
percentage.此指标将 \(RSS\)
重新定义为更易于解释的百分比。
\(SS_{total}\) (the denominator)
represents the total variance of the data. It’s the error you
would get if your “model” was just guessing the average value (\(\bar{y}\)) for every single observation.
(分母)表示数据的总方差。如果你的“模型”只是猜测每个观测值的平均值
(\(\bar{y}\)),那么你就会得到这个误差。
\(SS_{error}\) (the \(RSS\)) is the error after using
your model. 是“模型解释的总方差的比例”。 \(R^2\) 为 0.75
意味着你的模型可以解释响应变量 75% 的变异。
\(R^2\) is the “proportion of total
variance explained by the model.” An \(R^2\) of 0.75 means your model can explain
75% of the variation in the response variable.
The Flaw 缺陷: Just like \(RSS\), \(R^2\) will always increase (or
stay the same) as you add more predictors. This is visually confirmed in
Figure 6.1, where the red line for \(R^2\) only goes up. It will always pick the
most complex model. 与 \(RSS\)
一样,随着预测变量的增加,\(R^2\)
会始终增加(或保持不变)。图 6.1 直观地证实了这一点,其中 \(R^2\)
的红线只会上升。它总是会选择最复杂的模型。
Advanced
Metrics (For Model Selection) 高级指标(用于模型选择)
These metrics “fix” the flaw of \(R^2\) by including a penalty for the number
of predictors.
Mathematical Concept: This formula replaces the
“Sum of Squares” (\(SS\)) with “Mean
Squares” (\(MS\)).
\(MS_{error} =
\frac{RSS}{n-p-1}\)
\(MS_{total} =
\frac{SS_{total}}{n-1}\)
The “Penalty” Explained: The penalty is
degrees of freedom.
\(n\) = number of data points.
\(p\) = number of predictors.
The term \(n-p-1\) is the degrees
of freedom for the residuals. You start with \(n\) data points, but you “use up” one
degree of freedom to estimate the intercept (\(\hat{\beta}_0\)) and \(p\) more to estimate the \(p\) slopes.
How it Works:
When you add a new predictor (increase \(p\)), \(RSS\) goes down, which makes the numerator
(\(MS_{error}\)) smaller.
…But, increasing \(p\)also decreases the denominator (\(n-p-1\)), which makes the numerator (\(MS_{error}\)) larger.
This creates a “tug-of-war.” If the new predictor is
useful, it will drop \(RSS\) a lot, and Adjusted \(R^2\) will increase. If
the new predictor is useless, \(RSS\) will barely change, and the penalty
from decreasing the denominator will win, causing Adjusted \(R^2\) to decrease.
Goal: You select the model with the
highest Adjusted \(R^2\).
Akaike Information Criterion
(AIC)
General Formula:\(AIC =
-2 \log \ell(\hat{\theta}) + 2d\)
Concept Breakdown:
\(\ell(\hat{\theta})\): This is the
Maximized Likelihood Function.
The Likelihood Function\(\ell(\theta)\) asks: “Given a set of model
parameters \(\theta\), how probable is
the data we observed?”
The Maximum Likelihood Estimate (MLE)\(\hat{\theta}\) is the specific set of
parameters (the \(\hat{\beta}\)’s) that
maximizes this probability.
\(\log \ell(\hat{\theta})\): The
log-likelihood. This is just a number that represents
the best possible fit the model can achieve for the data. A
higher number is a better fit.
\(-2 \log \ell(\hat{\theta})\):
This is the Deviance. Since a higher log-likelihood is
better, a lower deviance is better. This term measures
poorness-of-fit.
\(d\): The number of parameters
estimated by the model. (e.g., \(p\)
predictors + 1 intercept).
\(2d\): This is the Penalty
Term.
How it Works:\(AIC =
(\text{Poorness-of-Fit}) + (\text{Complexity Penalty})\). As you
add predictors, the fit gets better (the deviance term goes down), but
the penalty term (\(2d\)) goes up.
Goal: You select the model with the
lowest AIC.
Bayesian Information
Criterion (BIC)
General Formula:\(BIC =
-2 \log \ell(\hat{\theta}) + \log(n)d\)
Concept: This is mathematically identical to AIC,
but the penalty term is different.
AIC Penalty:\(2d\)
BIC Penalty:\(\log(n)d\)
Comparison:
\(n\) is the number of observations
in your dataset.
As long as your dataset has 8 or more observations (\(n \ge 8\)), \(\log(n)\) will be greater than 2.
This means BIC applies a much harsher penalty for
complexity than AIC.
Consequence: BIC will tend to choose
simpler models (fewer predictors) than AIC.
Goal: You select the model with the
lowest BIC.
The Deeper Theory: Why AIC
Works
Slide 27 (“Understanding AIC”) gives the deep mathematical
justification.
Goal: We have a true, unknown process
\(p\) that generates our data. We are
creating a model \(\hat{p}_j\). We want
our model to be as “close” to the truth as possible.
Kullback-Leibler (K-L) Distance: This is a function
\(K(p, \hat{p}_j)\) that measures the
“information lost” when you use your model \(\hat{p}_j\) to approximate the truth \(p\). You want to minimize this
distance.
This splits into: \(K(p, \hat{p}_j) =
\underbrace{\int p(y) \log(p(y)) dy}_{\text{Constant}} -
\underbrace{\int p(y) \log(\hat{p}_j(y)) dy}_{\text{This is what we need
to maximize}}\)
The Problem: We can’t calculate that second term
because it requires knowing the true function \(p\).
Akaike’s Insight: Akaike proved that the
log-likelihood we can calculate, \(\log \ell(\hat{\theta})\), is a
biased estimator of that target. He also proved that the bias
is approximately \(-d\).
The Solution: An unbiased estimate of the
target is \(\log \ell(\hat{\theta}) -
d\).
Final Step: For historical and statistical reasons,
he multiplied this by \(-2\) to create
the final AIC formula.
Conclusion: AIC is not just a random formula. It is
a carefully derived estimate of how much information your model loses
compared to the “truth” (i.e., its expected performance on new
data).
AIC/BIC for Linear Regression
Slide 26 shows how these general formulas simplify for linear
regression (assuming normal, Gaussian errors).
General Formula:\(AIC =
-2 \log \ell(\hat{\theta}) + 2d\)
Linear Regression Formula:\(AIC = \frac{1}{n\hat{\sigma}^2}(RSS +
2d\hat{\sigma}^2)\)
Key Insight: For linear regression, the
“poorness-of-fit” term (\(-2 \log
\ell(\hat{\theta})\)) is directly proportional to the
\(RSS\).
This makes it much easier to understand. You can just think of the
formulas as: * AIC \(\approx\)\(RSS + 2d\hat{\sigma}^2\) *
BIC \(\approx\)\(RSS + \log(n)d\hat{\sigma}^2\)
(Here \(\hat{\sigma}^2\) is an
estimate of the error variance, which can often be treated as a
constant).
This clearly shows the trade-off: We want a model with a low
\(RSS\) (good fit) and
a low \(d\) (low
complexity). These two goals are in direct competition.
Mallow’s \(C_p\):
The slide notes that \(C_p\) is
equivalent to AIC for linear regression. The \(C_p\) formula is \(C_p = \frac{1}{n}(RSS +
2d\hat{\sigma}^2_{full})\), where \(\hat{\sigma}^2_{full}\) is the error
variance estimated from the full model. Since \(n\) and \(\hat{\sigma}^2_{full}\) are constants,
minimizing \(C_p\) is mathematically
identical to minimizing \(RSS +
2d\hat{\sigma}^2_{full}\), which is the same logic as AIC.
Here is a detailed breakdown of the mathematical formulas and
concepts from your slides.
The Core
Problem: Training Error vs. Test Error
The central theme of these slides is finding the “best” model. The
problem is that a model with more predictors (more complex) will
always fit the data it was trained on better. This is a
trap.
Training Error: How well the model fits the data we
used to build it. \(R^2\) and
\(RSS\) measure this.
Test Error: How well the model predicts new, unseen
data. This is what we actually care about. A model that is too
complex (e.g., has 10 predictors when only 3 are useful) will have low
training error but very high test error. This is called
overfitting.
The goal is to choose a model that has the lowest test
error. The metrics below (Adjusted \(R^2\), AIC, BIC) are all attempts to
estimate this test error without having to actually collect new
data. They do this by adding a penalty for
complexity.
Basic Metrics (Measures of
Fit)
These formulas from slide 13 describe how well a model fits the
training data.
Concept: This is the most basic building block.
It’s the difference between the actual observed value (\(y_i\)) and the value your model
predicted (\(\hat{y}_i\)). It
is the “error” for a single data point.
Concept: This is the overall measure of model
error. You square all the individual errors (residues) to make them
positive and then add them all up.
Goal: The entire process of linear regression
(called “Ordinary Least Squares”) is designed to find the \(\hat{\beta}\) coefficients that make this
RSS value as small as possible.
The Flaw:\(RSS\)
will always decrease (or stay the same) as you add more
predictors (\(p\)). A model with all 10
predictors will have a lower \(RSS\)
than a model with 9, even if that 10th predictor is useless. Therefore,
\(RSS\) is useless for choosing
between models of different sizes.
Concept: This metric reframes \(RSS\) into a more interpretable percentage.
\(SS_{total}\) (the denominator)
represents the total variance of the data. It’s the error you
would get if your “model” was just guessing the average value (\(\bar{y}\)) for every single
observation.
\(SS_{error}\) (the \(RSS\)) is the error after using
your model.
\(R^2\) is the “proportion of total
variance explained by the model.” An \(R^2\) of 0.75 means your model can explain
75% of the variation in the response variable.
The Flaw: Just like \(RSS\), \(R^2\) will always increase (or
stay the same) as you add more predictors. This is visually confirmed in
Figure 6.1, where the red line for \(R^2\) only goes up. It will always pick the
most complex model.
Advanced Metrics (For
Model Selection)
These metrics “fix” the flaw of \(R^2\) by including a penalty for the number
of predictors.
Mathematical Concept: This formula replaces the
“Sum of Squares” (\(SS\)) with “Mean
Squares” (\(MS\)).
\(MS_{error} =
\frac{RSS}{n-p-1}\)
\(MS_{total} =
\frac{SS_{total}}{n-1}\)
The “Penalty” Explained: The penalty is
degrees of freedom.
\(n\) = number of data points.
\(p\) = number of predictors.
The term \(n-p-1\) is the degrees
of freedom for the residuals. You start with \(n\) data points, but you “use up” one
degree of freedom to estimate the intercept (\(\hat{\beta}_0\)) and \(p\) more to estimate the \(p\) slopes.
How it Works:
When you add a new predictor (increase \(p\)), \(RSS\) goes down, which makes the numerator
(\(MS_{error}\)) smaller.
…But, increasing \(p\)also decreases the denominator (\(n-p-1\)), which makes the numerator (\(MS_{error}\)) larger.
This creates a “tug-of-war.” If the new predictor is
useful, it will drop \(RSS\) a lot, and Adjusted \(R^2\) will increase. If
the new predictor is useless, \(RSS\) will barely change, and the penalty
from decreasing the denominator will win, causing Adjusted \(R^2\) to decrease.
Goal: You select the model with the
highest Adjusted \(R^2\).
Akaike Information Criterion
(AIC)
General Formula:\(AIC =
-2 \log \ell(\hat{\theta}) + 2d\)
Concept Breakdown:
\(\ell(\hat{\theta})\): This is the
Maximized Likelihood Function.
The Likelihood Function\(\ell(\theta)\) asks: “Given a set of model
parameters \(\theta\), how probable is
the data we observed?”
The Maximum Likelihood Estimate (MLE)\(\hat{\theta}\) is the specific set of
parameters (the \(\hat{\beta}\)’s) that
maximizes this probability.
\(\log \ell(\hat{\theta})\): The
log-likelihood. This is just a number that represents
the best possible fit the model can achieve for the data. A
higher number is a better fit.
\(-2 \log \ell(\hat{\theta})\):
This is the Deviance. Since a higher log-likelihood is
better, a lower deviance is better. This term measures
poorness-of-fit.
\(d\): The number of parameters
estimated by the model. (e.g., \(p\)
predictors + 1 intercept).
\(2d\): This is the Penalty
Term.
How it Works:\(AIC =
(\text{Poorness-of-Fit}) + (\text{Complexity Penalty})\). As you
add predictors, the fit gets better (the deviance term goes down), but
the penalty term (\(2d\)) goes up.
Goal: You select the model with the
lowest AIC.
Bayesian Information
Criterion (BIC)
General Formula:\(BIC =
-2 \log \ell(\hat{\theta}) + \log(n)d\)
Concept: This is mathematically identical to AIC,
but the penalty term is different.
AIC Penalty:\(2d\)
BIC Penalty:\(\log(n)d\)
Comparison:
\(n\) is the number of observations
in your dataset.
As long as your dataset has 8 or more observations (\(n \ge 8\)), \(\log(n)\) will be greater than 2.
This means BIC applies a much harsher penalty for
complexity than AIC.
Consequence: BIC will tend to choose
simpler models (fewer predictors) than AIC.
Goal: You select the model with the
lowest BIC.
The Deeper Theory: Why AIC
Works
Slide 27 (“Understanding AIC”) gives the deep mathematical
justification.
Goal: We have a true, unknown process
\(p\) that generates our data. We are
creating a model \(\hat{p}_j\). We want
our model to be as “close” to the truth as possible.
Kullback-Leibler (K-L) Distance: This is a function
\(K(p, \hat{p}_j)\) that measures the
“information lost” when you use your model \(\hat{p}_j\) to approximate the truth \(p\). You want to minimize this
distance.
This splits into: \(K(p, \hat{p}_j) =
\underbrace{\int p(y) \log(p(y)) dy}_{\text{Constant}} -
\underbrace{\int p(y) \log(\hat{p}_j(y)) dy}_{\text{This is what we need
to maximize}}\)
The Problem: We can’t calculate that second term
because it requires knowing the true function \(p\).
Akaike’s Insight: Akaike proved that the
log-likelihood we can calculate, \(\log \ell(\hat{\theta})\), is a
biased estimator of that target. He also proved that the bias
is approximately \(-d\).
The Solution: An unbiased estimate of the
target is \(\log \ell(\hat{\theta}) -
d\).
Final Step: For historical and statistical reasons,
he multiplied this by \(-2\) to create
the final AIC formula.
Conclusion: AIC is not just a random formula. It is
a carefully derived estimate of how much information your model loses
compared to the “truth” (i.e., its expected performance on new
data).
AIC/BIC for Linear
Regression
Slide 26 shows how these general formulas simplify for linear
regression (assuming normal, Gaussian errors).
General Formula:\(AIC =
-2 \log \ell(\hat{\theta}) + 2d\)
Linear Regression Formula:\(AIC = \frac{1}{n\hat{\sigma}^2}(RSS +
2d\hat{\sigma}^2)\)
Key Insight: For linear regression, the
“poorness-of-fit” term (\(-2 \log
\ell(\hat{\theta})\)) is directly proportional to the
\(RSS\).
This makes it much easier to understand. You can just think of the
formulas as: * AIC \(\approx\)\(RSS + 2d\hat{\sigma}^2\) *
BIC \(\approx\)\(RSS + \log(n)d\hat{\sigma}^2\)
(Here \(\hat{\sigma}^2\) is an
estimate of the error variance, which can often be treated as a
constant).
This clearly shows the trade-off: We want a model with a low
\(RSS\) (good fit) and
a low \(d\) (low
complexity). These two goals are in direct competition.
Mallow’s \(C_p\):
The slide notes that \(C_p\) is
equivalent to AIC for linear regression. The \(C_p\) formula is \(C_p = \frac{1}{n}(RSS +
2d\hat{\sigma}^2_{full})\), where \(\hat{\sigma}^2_{full}\) is the error
variance estimated from the full model. Since \(n\) and \(\hat{\sigma}^2_{full}\) are constants,
minimizing \(C_p\) is mathematically
identical to minimizing \(RSS +
2d\hat{\sigma}^2_{full}\), which is the same logic as AIC.
3. Variable Selection
Core Concept:
The Problem of Variable Selection
In regression, we want to model a response variable \(Y\) using a set of \(p\) predictor variables \(X_1, X_2, ..., X_p\).
The “Kitchen Sink” Problem: A common temptation
is to include all available predictors in the model: \[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... +
\beta_pX_p + \epsilon\] This often leads to
overfitting. The model may fit the training data well
but will perform poorly on new, unseen data. It’s also hard to interpret
a model with dozens of predictors.
The Solution: Subset Selection. The goal is to
find a smaller subset of the predictors that builds a model that is:
Accurate: Has low prediction error.
Parsimonious: Uses the fewest predictors
necessary.
Interpretable: Is simple enough for a human to
understand.
Your slides present two main methods to achieve this: Best
Subset Selection and Forward Stepwise
Selection.
Method 1: Best Subset
Selection (BSS)
This is the “brute force” approach. It considers every single
possible model.
Conceptual Algorithm
Fit all models with \(k=1\)
predictor (there are \(p\) of these).
Find the best one (lowest RSS) and call it \(M_1\).
Fit all models with \(k=2\)
predictors (there are \(\binom{p}{2}\)
of these). Find the best one and call it \(M_2\).
…
Fit the one model with \(k=p\)
predictors (the full model), \(M_p\).
You now have a list of \(p\) “best”
models: \(M_1, M_2, ..., M_p\).
Use a selection criterion (like Adjusted \(R^2\), BIC,
AIC, or \(C_p\)) to choose the single best
model from this list.
For each predictor, there are two possibilities: it’s either
IN the model or OUT.
With \(p\) predictors, the total
number of models to test is \(2 \times 2
\times ... \times 2\) (\(p\)
times).
Total Models = \(2^p\)
This is a “combinatorial explosion.” As the slide notes, if \(p=20\), \(2^{20}
= 1,048,576\) models. This is computationally infeasible for
large \(p\).
Method 2: Forward
Stepwise Selection (FSS)
This is a “greedy” algorithm. It’s an efficient alternative to BSS
that does not test every model.
Step 1: Start with the null
model, \(M_0\), which has no
predictors. \[M_0: Y = \beta_0 +
\epsilon\] The prediction is just the sample mean of \(Y\).
Step 2 (Iterative):
For \(k=0\) (to get \(M_1\)): Fit all \(p\) models that add one predictor
to \(M_0\). Choose the best one (lowest
RSS or highest \(R^2\)). This is \(M_1\). Let’s say it contains \(X_1\).
For \(k=1\) (to get \(M_2\)):Keep\(X_1\) in the model. Fit all \(p-1\) models that add one more
predictor to \(M_1\) (e.g., \(M_1+X_2\), \(M_1+X_3\), …). Choose the best of these.
This is \(M_2\).
Repeat: Continue this process, adding one variable
at a time, until all \(p\) predictors
are in the model \(M_p\).
Step 3: You now have a sequence of \(p+1\) models: \(M_0, M_1, ..., M_p\). Choose the single
best model from this sequence using Adjusted \(R^2\), AIC,
BIC, or \(C_p\).
As the slide notes, if \(p=20\),
this is only \(1 + 20(21)/2 = 211\)
models. This is vastly more efficient than BSS.
Key weakness: The method is “greedy.” If it adds
\(X_1\) in Step 1, it can
never be removed. It’s possible the true best 2-variable model
is \((X_2, X_3)\), but if FSS chose
\(X_1\) as the best 1-variable model,
it will never find \((X_2, X_3)\).
4. How to Choose the
“Best” Model: The Criteria
You can’t use RSS or \(R^2\) to compare models with
different numbers of predictors (\(k\)). This is because RSS always decreases
(and \(R^2\) always increases) as you
add more variables. You must use a criterion that penalizes
complexity.
RSS (Residual Sum of Squares): Goal is to
minimize. \[RSS =
\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] Good for comparing models
of the same size \(k\).
Adjusted R-squared (\(Adj.
R^2\)): Goal is to maximize. \[Adj. R^2 = 1 -
\frac{(1-R^2)(n-1)}{n-p-1}\] This “adjusts” \(R^2\) by adding a penalty for having more
predictors (\(p\)). Adding a useless
predictor will make \(Adj. R^2\) go
down.
Mallow’s \(C_p\): Goal is to
minimize. \[C_p \approx
\frac{1}{n}(RSS + 2p\hat{\sigma}^2)\] Here, \(\hat{\sigma}^2\) is an estimate of the
error variance from the full model (with all \(p\) predictors). A good model will have
\(C_p \approx p\).
AIC (Akaike Information Criterion) & BIC (Bayesian
Information Criterion): Goal is to minimize.
\[AIC = 2p - 2\ln(\hat{L})\]\[BIC = p\ln(n) - 2\ln(\hat{L})\] Here,
\(\hat{L}\) is the maximized likelihood
of the model. You don’t need to calculate this by hand; software
provides it.
Key difference: BIC’s penalty for \(p\) is \(p\ln(n)\), while AIC’s is \(2p\). Since \(\ln(n)\) is almost always \(> 2\) (for \(n>7\)), BIC applies a much
heavier penalty for complexity.
This means BIC tends to choose smaller, more parsimonious
models than AIC or \(Adj.
R^2\).
5. Python Code Analysis
(Slide 225546.jpg)
This slide shows the Python code for Best Subset
Selection (BSS).
1 2 3 4 5 6
# Import necessary libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import statsmodels.api as sm from itertools import combinations # <-- This is the BSS engine
pd.read_csv: Reads the data into a
pandas DataFrame.
.map(): This is a crucial
preprocessing step. Regression models require numbers, not text like
‘Yes’ or ‘Male’. This line converts those strings into 1s
and 0s.
sns.pairplot: A powerful visualization
from the seaborn library. The resulting plot (right side of
the slide) is a grid.
Diagonal plots (kde): Show the distribution (Kernel
Density Estimate) of a single variable (e.g., ‘Balance’ is skewed
right).
Off-diagonal plots (scatter): Show the relationship
between two variables (e.g., ‘Limit’ and ‘Rating’ are almost perfectly
linear). This helps you visually spot potentially strong
predictors.
# 3. Best Subset Selection # (This code is incomplete on the slide, I'll fill in the logic)
# Define target and predictors target = 'Balance' predictors = [col for col in Credit.columns if col != target] nvmax = 10# Max number of predictors to test (up to 10)
# Initialize lists to store model statistics model_stats = []
# Iterate over number of predictors from 1 to nvmax for k inrange(1, nvmax + 1): # Generate all possible combinations of predictors of size k # This is the core of BSS for subset inlist(combinations(predictors, k)): # Get the design matrix (X) X_subset = Credit[list(subset)] # Add a constant (intercept) term to the model # Y = B0 + B1*X1 -> statsmodels needs B0 to be added manually X_subset_const = sm.add_constant(X_subset) # Get the target variable (y) y_target = Credit[target] # Fit the Ordinary Least Squares (OLS) model model = sm.OLS(y_target, X_subset_const).fit() # Calculate RSS RSS = ((model.resid) ** 2).sum() # (The full code would also calculate R-squared, Adj. R-sq, BIC, etc. here) # model_stats.append({'k': k, 'subset': subset, 'RSS': RSS, ...})
for k in range(1, nvmax + 1): This is
the outer loop that iterates from \(k=1\) (1 predictor) to \(k=10\) (10 predictors).
list(combinations(predictors, k)):
This is the inner loop and the most important
line. The itertools.combinations function is a
highly efficient way to generate all unique subsets.
When \(k=1\), it returns
[('Income',), ('Limit',), ('Rating',), ...].
When \(k=2\), it returns
[('Income', 'Limit'), ('Income', 'Rating'), ('Limit', 'Rating'), ...].
This is what generates the \(2^p\)
(or in this case, \(\sum_{k=1}^{10}
\binom{p}{k}\)) models to test.
sm.add_constant(X_subset): Your
regression equation is \(Y = \beta_0 +
\beta_1X_1\). The \(X_1\) is
your X_subset. The sm.add_constant function
adds a column of 1s to your data, which allows the
statsmodels library to estimate the \(\beta_0\) (intercept) term.
sm.OLS(y_target, X_subset_const).fit():
This fits the Ordinary Least Squares (OLS) model, which finds the \(\beta\) coefficients that minimize
the RSS.
model.resid: This attribute of the
fitted model contains the residuals (\(e_i =
y_i - \hat{y}_i\)) for each data point.
((model.resid) ** 2).sum(): This line
is the direct code implementation of the formula \(RSS = \sum e_i^2\).
Synthesizing the Results
(The Plots)
After running the BSS code, you get the data used in the plots and
the table.
Image 225550.png (Adjusted
R-squared)
Goal: Maximize.
What it shows: The gray dots are all the
models tested for each \(k\). The red
line connects the single best model for each \(k\).
Conclusion: The plot shows a sharp “elbow.” The
\(Adj. R^2\) increases dramatically up
to \(k=4\), then increases very slowly.
The maximum is around \(k=6\) or \(k=7\), but the gain after \(k=4\) is minimal.
Image 225554.png (BIC)
Goal: Minimize.
What it shows: BIC heavily penalizes
complexity.
Conclusion: The plot shows a very clear minimum.
The BIC value plummets from \(k=2\) to
\(k=3\) and hits its lowest point at
\(k=4\). After \(k=4\), the penalty for adding more
variables is larger than the benefit in model fit, so the BIC
score starts to rise. This is a very strong vote for the 4-predictor
model.
Image 225635.png (Mallow’s \(C_p\))
Goal: Minimize.
What it shows: A very similar story to BIC.
Conclusion: The \(C_p\) value drops significantly and hits
its minimum at \(k=4\).
Image 225638.png (Summary
Table)
This is the most important image for the final
conclusion. It summarizes the red line from all the plots.
Look at the row for Num_Predictors = 4. The predictors
are (Income, Limit, Cards, Student).
Now look at the columns for BIC and Cp.
BIC:4841.615607. This is the lowest
value in the entire BIC column (the value at \(k=3\) is 4865.352851).
Cp:7.122228. This is also the lowest
value in the Cp column.
The Adj_R_squared at \(k=4\) is 0.953580, which is
very close to its maximum of ~0.954 at \(k=7-10\).
Final Conclusion: All three “penalized” criteria
(Adjusted \(R^2\), BIC, and \(C_p\)) point to the same conclusion. While
\(Adj. R^2\) is a bit ambiguous,
BIC and \(C_p\) provide a clear
signal that the best, most parsimonious model is the 4-predictor model
using Income, Limit, Cards, and
Student.
4. Subset Selection
Summary of Subset Selection
These slides introduce subset selection, a process
in statistical learning used to identify the best subset of predictors
(variables) for a regression model. The goal is to find a model that has
low prediction error and avoids overfitting by excluding irrelevant
variables.
The slides cover two main “greedy” (stepwise) algorithms and the
criteria used to select the final best model.
Stepwise Selection
Algorithms
Instead of testing all \(2^p\)
possible models (which is “best subset selection” and computationally
unfeasible), stepwise methods build a single path of models.
Forward Stepwise Selection
This is an additive (bottom-up) approach:
Start with the null model (no predictors).
Find the best 1-variable model (the one that gives
the lowest Residual Sum of Squares, or RSS).
Add the single variable that, when added to the
current model, results in the new best model (lowest RSS).
Repeat this process until all \(p\) predictors are in the model.
This generates a sequence of \(p+1\) models, from \(\mathcal{M}_0\) to \(\mathcal{M}_p\).
Backward Stepwise Selection
This is a subtractive (top-down) approach:
Start with the full model containing all \(p\) predictors.
Find the best \((p-1)\)-variable model by removing
the single variable that results in the lowest RSS (or highest
\(R^2\)). This variable is considered
the least significant.
Remove the next variable that, when removed from
the current best model, gives the new best model.
Repeat until only the null model remains.
This also generates a sequence of \(p+1\) models.
Pros and Cons (Backward
Selection)
Pro: Computationally efficient compared to best
subset. It fits \(1 + \sum_{k=0}^{p-1}(p-k) =
\mathbf{1 + p(p+1)/2}\) models, which is much less than \(2^p\). (e.g., for \(p=20\), it’s 211 models vs. >1
million).
Con:Cannot be used if \(p > n\) (more predictors than
observations), because the initial full model cannot be fit.
Con (for both): These methods are
greedy. A variable added in forward selection is
never removed, and a variable removed in backward selection is
never added back. This means they are not guaranteed to find
the true best model.
Choosing the Final Best
Model
Both forward and backward selection give you a set of candidate
models (e.g., the best 1-variable model, best 2-variable model, etc.).
You must then choose the single best one. The slides show two
main approaches:
A. Direct Error Estimation
Use a validation set or cross-validation (CV) to estimate the test
error for each model (e.g., the 1-variable, 2-variable… models).
Choose the model with the lowest estimated test
error.
B. Adjusted
Metrics (Penalizing for Complexity)
Standard RSS and \(R^2\) will always
improve as you add variables, leading to overfitting. Instead, use
metrics that penalize the model for having too many
predictors.
Mallows’ \(C_p\): An estimate of test Mean
Squared Error (MSE). \[C_p = \frac{1}{n} (RSS
+ 2d\hat{\sigma}^2)\] (where \(d\) is the number of predictors, and \(\hat{\sigma}^2\) is an estimate of the
error variance). You want to find the model with the
minimum\(C_p\).
BIC (Bayesian Information Criterion):\[BIC = \frac{1}{n} (RSS +
\log(n)d\hat{\sigma}^2)\] BIC’s penalty \(\log(n)\) is stronger than \(C_p\)’s (or AIC’s) penalty of \(2\), so it tends to select smaller
(more parsimonious) models. You want to find the model with the
minimum BIC.
Adjusted \(R^2\):\[R^2_{adj} = 1 -
\frac{RSS/(n-d-1)}{TSS/(n-1)}\] (where \(TSS\) is the Total Sum of Squares). Unlike
\(R^2\), this metric can decrease if
adding a variable doesn’t help enough. You want to find the
model with the maximum Adjusted \(R^2\).
Python Code Understanding
The slides use the regsubsets() function from the
leaps package in R.
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.feature_selection import SequentialFeatureSelector from sklearn.model_selection import train_test_split
# Assume 'Credit' is a pandas DataFrame with 'Balance' as the target X = Credit.drop('Balance', axis=1) y = Credit['Balance']
# Initialize the linear regression estimator model = LinearRegression()
# --- Forward Selection --- # direction='forward' starts with 0 features and adds them # To get the best 4-variable model, for example: sfs_forward = SequentialFeatureSelector( model, n_features_to_select=4, direction='forward', cv=None# Or use cross-validation, e.g., cv=10 ) sfs_forward.fit(X, y) print("Forward selection best 4 features:") print(sfs_forward.get_feature_names_out())
# --- Backward Selection --- # direction='backward' starts with all features and removes them sfs_backward = SequentialFeatureSelector( model, n_features_to_select=4, direction='backward', cv=None ) sfs_backward.fit(X, y) print("\nBackward selection best 4 features:") print(sfs_backward.get_feature_names_out())
# Note: To replicate the plots, you would loop this process, # changing 'n_features_to_select' from 1 to p, # record the model scores (e.g., RSS, AIC, BIC) at each step, # and then plot the results.
What they are: These \(2
\times 2\) plot grids are the most important visuals. They show
Residual Sum of Squares (RSS), Adjusted \(R^2\), BIC, and
Mallows’ \(C_p\)
plotted against the Number of Variables.
Why they’re important: They are the
decision-making tool. You use these plots to choose the
best model.
You look for the “elbow” or minimum value for BIC
and \(C_p\).
You look for the “peak” or maximum value for
Adjusted \(R^2\).
(RSS is not used for selection as it always decreases).
Slide ...230040.png (Find the best
model):
What it is: This slide shows a close-up of the
\(C_p\), BIC, and Adjusted \(R^2\) plots, with the “best” model (the
min/max) marked with a blue ‘x’.
Why it’s important: It explicitly states the
selection criteria. The text highlights that BIC suggests a 4-variable
model, while the other two are “rather flat” after 4, making the choice
less obvious but pointing to a simple model.
Slide ...230045.png (BIC vs. Validation
vs. CV):
What it is: This shows three plots for selecting
the best model using different criteria: BIC, Validation Set Error, and
Cross-Validation Error.
Why it’s important: It shows that different
selection criteria can lead to different “best” models. Here,
BIC (a mathematical adjustment) picks a 4-variable model, while
validation and CV (direct error estimation) both pick a 6-variable
model.
The slides use the Credit dataset to demonstrate two key
tasks: 1. Running different subset selection algorithms
(forward, backward, best). 2. Using various statistical
metrics (BIC, \(C_p\), CV error) to
choose the single best model.
Comparing Selection
Algorithms (The Path)
This part of the example compares the sequence of models
selected by “Forward Stepwise” selection versus “Best Subset”
selection.
Key Result (from Table 6.1):
This table is the most important result for comparing the
algorithms.
Variables
Best Subset
Forward Stepwise
one
rating
rating
two
rating,
income
rating,
income
three
rating, income,
student
rating, income,
student
four
cards, income,
student, limit
rating, income,
student, limit
Summary of this result:
Identical for 1, 2, and 3 variables: Both methods
agree on the best one-variable model (rating), the best
two-variable model (rating, income), and the
best three-variable model (rating, income,
student).
They Diverge at 4 variables:
Forward selection is greedy. It started
with rating, income, student and
was “stuck” with them. It then added limit, as that was the
best variable to add to its existing 3-variable model.
Best subset selection is not greedy. It
tests all possible 4-variable combinations. It discovered that the model
cards, income, student,
limit has a slightly lower RSS than the model forward
selection found.
Main Takeaway: This demonstrates the limitation of
a greedy algorithm. Forward selection missed the “true” best 4-variable
model because it was locked into its previous choices and couldn’t “swap
out” rating for cards.
Choosing the
Single Best Model (The Destination)
This is the most critical part of the analysis. After running a
selection algorithm (like forward, backward, or best subset), you get a
list of the “best” models for each size (best 1-variable, best
2-variable, etc.). Now you must decide: is the best model the
4-variable one, the 6-variable one, or another?
The slides show several plots to help make this decision, all plotted
against the “Number of Predictors.”
Summary of Plot Results:
Here’s what each plot tells you:
Residual Sum of Squares (RSS) (e.g., in slide
...230014.png, top-left)
What it shows: RSS always decreases as you
add more variables. It drops sharply until 4 variables, then flattens
out.
Conclusion: This plot is not useful for
picking the best model because it will always pick the full
model, which is overfit. It’s only used to see the diminishing returns
of adding new variables.
Adjusted \(R^2\)
(e.g., in slide ...230040.png, right)
What it shows: This metric penalizes adding useless
variables. The plot rises quickly, then flattens, peaking at its
maximum value around 6 or 7 variables.
Conclusion: This metric suggests a 6 or
7-variable model.
Mallows’ \(C_p\)
(e.g., in slide ...230040.png, left)
What it shows: This is an estimate of test error.
We want the model with the minimum \(C_p\). The plot drops to a low
value at 4 variables and stays low, with its absolute minimum around
6 or 7 variables.
Conclusion: This metric also suggests a 6
or 7-variable model.
BIC (Bayesian Information Criterion) (e.g., in
slide ...230040.png, center)
What it shows: This is another estimate of test
error, but it has a stronger penalty for model complexity. The
plot shows a clear “U” shape, reaching its minimum value at 4
variables and then increasing afterward.
Conclusion: This metric strongly suggests a
4-variable model.
Validation Set & Cross-Validation (CV) Error
(Slide ...230045.png)
What it shows: These plots show the direct
estimate of test error (not a mathematical adjustment like BIC or \(C_p\)). Both the validation set error and
the 10-fold CV error show a “U” shape.
Conclusion: Both methods reach their
minimum error at 6 variables. This is considered a very
reliable result.
Final Summary of Results
The analysis of the Credit dataset reveals two strong
candidates for the “best” model, depending on your goal:
The 6-Variable Model: This model is supported by
the Adjusted \(R^2\),
Mallows’ \(C_p\), and
(most importantly) the Validation Set and
10-fold Cross-Validation results. These metrics all
indicate that the 6-variable model has the lowest prediction
error on new data.
The 4-Variable Model: This model is supported by
BIC. Because BIC penalizes complexity more heavily, it
selects a simpler (more parsimonious) model.
Overall Conclusion: If your primary goal is
maximum predictive accuracy, you should choose the
6-variable model. If your goal is a simpler,
more interpretable model that is still very good (and avoids
any risk of overfitting), the 4-variable model is an
excellent choice.
5.
Two main strategies for controlling model complexity in linear
regression
This presentation covers two main strategies for controlling model
complexity in linear regression: Subset Selection
(choosing which variables to include) and Shrinkage
Methods (keeping all variables but reducing the impact
of their coefficients).
Subset Selection
This method involves selecting a subset of the \(p\) total predictors to use in the
model.
Key Concepts & Formulas
The Model: The standard linear regression model
is represented in matrix form: \[\mathbf{y} =
\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\] The goal
of subset selection is to find a coefficient vector \(\boldsymbol{\beta}\) that is
sparse, meaning it has many zero entries.
Forward Selection: This is a greedy
algorithm that starts with an empty model and iteratively adds the
single predictor that most improves the fit.
Theoretical Guarantee: Can forward selection
find the true sparse set of variables?
Yes, if the predictors are not strongly correlated.
This is quantified by the Mutual Coherence
Condition. Assuming the predictors \(\mathbf{x}_i\) are normalized, the method
is guaranteed to work if: \[\mu = \max_{i
\neq j} |\langle \mathbf{x}_i, \mathbf{x}_j \rangle| < \frac{1}{2s -
1}\] where \(s\) is the number
of true non-zero coefficients and \(\langle
\mathbf{x}_i, \mathbf{x}_j \rangle\) represents the correlation
between predictors.
Practical
Application: Finding the Best Model Size
How do you know whether to choose a model with 3, 4, or 5 variables?
You use Cross-Validation (CV).
Important Image: The plot titled “10-fold CV”
(from the first slide) is the most important visual. It plots the
estimated test error (CV Error) on the y-axis against the number of
variables in the model on the x-axis.
The “One Standard Deviation Rule”: Looking at
the plot, the error drops sharply and then flattens. The absolute
minimum error might be at 6 variables, but it’s only slightly better
than the 3-variable model.
Find the model with the lowest CV error.
Calculate the standard error for that error estimate.
Select the simplest model (fewest variables) whose
error is within one standard deviation of the minimum.
This follows Occam’s razor: choose the simplest
explanation (model) that fits the data well enough. In the example
given, this rule selects the 3-variable model.
Code Interpretation (R
vs. Python)
The R code in the first slide performs this 10-fold CV manually for
forward selection:
It loops from p = 1 to 10 (model
sizes).
Inside the loop, it identifies the p variables chosen
by a pre-computed forward selection model
(regfit.fwd).
It fits a new model (glm.fit) using only those
p variables.
It runs 10-fold CV (cv.glm) on that specific
model to get its test error.
It stores the error in CV10.err[p].
Finally, it plots the results.
In Python (with scikit-learn): This
entire process is often automated.
You would use sklearn.feature_selection.RFECV
(Recursive Feature Elimination with Cross-Validation).
RFECV automatically performs cross-validation to find
the optimal number of features, effectively producing the same plot and
result as the R code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Conceptual Python equivalent for finding the best model size from sklearn.linear_model import LinearRegression from sklearn.feature_selection import RFECV from sklearn.datasets import make_regression
# X, y = load_your_data() X, y = make_regression(n_samples=100, n_features=10, n_informative=3, noise=10, random_state=42)
estimator = LinearRegression() # RFECV will test models with 1 feature, 2 features, etc., # and use cross-validation (cv=10) to find the best number. selector = RFECV(estimator, step=1, cv=10, scoring='neg_mean_squared_error') selector = selector.fit(X, y)
print(f"Optimal number of features: {selector.n_features_}") # You can plot selector.cv_results_['mean_test_score'] to get the CV curve
Shrinkage Methods
(Regularization)
Instead of explicitly removing variables, shrinkage methods keep all
\(p\) variables but shrink
their coefficients \(\beta_j\) towards
zero.
Ridge Regression
Ridge regression is a prime example of a shrinkage method.
Objective Function: It finds the coefficients
\(\boldsymbol{\beta}\) that minimize a
new quantity: \[\underbrace{\sum_{i=1}^{n}
(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2}_{\text{RSS (Goodness
of Fit)}} + \underbrace{\lambda \sum_{j=1}^{p}
\beta_j^2}_{\text{$\ell_2$ Penalty (Shrinkage)}}\]
The \(\lambda\) Tuning
Parameter: This parameter controls the strength of the
penalty:
If \(\lambda =
0\): The penalty term disappears. Ridge regression is
identical to standard Ordinary Least Squares (OLS).
If \(\lambda \to
\infty\): The penalty is “infinitely” strong. To
minimize the function, all coefficients \(\beta_j\) (for \(j=1...p\)) are forced to be zero. The model
becomes an intercept-only model.
Note: The intercept \(\beta_0\) is not penalized.
The Bias-Variance Trade-off: This is the core
concept of regularization.
Standard OLS has low bias but can have high variance (it
overfits).
Ridge regression adds a small amount of bias (the
coefficients are “wrong” on purpose) to significantly reduce the
model’s variance.
This trade-off often leads to a model with a lower overall test
error.
Matrix Solution: The discussion slide asks “What
is the solution?”. While OLS has the solution \(\hat{\boldsymbol{\beta}} =
(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\), the Ridge
solution is: \[\hat{\boldsymbol{\beta}}^R =
(\mathbf{X}^T\mathbf{X} + \lambda
\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\] where \(\mathbf{I}\) is the identity matrix. The
\(\lambda \mathbf{I}\) term adds a
“ridge” to the diagonal, making the matrix invertible even if \(\mathbf{X}^T\mathbf{X}\) is singular (which
happens if \(p > n\) or predictors
are collinear).
An Essential Step:
Standardization
Problem: The \(\ell_2\) penalty \(\lambda \sum \beta_j^2\) is applied equally
to all coefficients. If predictor \(x_1\) (e.g., house size in sq-ft) is on a
much larger scale than \(x_2\) (e.g.,
number of rooms), its coefficient \(\beta_1\) will naturally be much smaller
than \(\beta_2\). The penalty will
unfairly punish \(\beta_2\) more.
Solution: You must standardize
your inputs before fitting a Ridge model.
Formula: For each predictor \(X_j\), all its observations \(x_{ij}\) are rescaled: \[\tilde{x}_{ij} = \frac{x_{ij} -
\bar{x}_j}{\sigma_j}\] (where \(\bar{x}_j\) is the mean of the predictor
and \(\sigma_j\) is its standard
deviation). This puts all predictors on a common scale (mean=0,
std=1).
In Python (with scikit-learn):
You use sklearn.preprocessing.StandardScaler to
standardize your data.
You use sklearn.linear_model.Ridge to fit the
model.
You use sklearn.linear_model.RidgeCV to automatically
find the best value for \(\lambda\)
(called alpha in scikit-learn) using cross-validation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# Conceptual Python code for Ridge Regression from sklearn.linear_model import RidgeCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline
# X, y = load_your_data()
# Create a pipeline that first standardizes the data, # then fits a Ridge model. # RidgeCV tests a range of alphas (lambdas) automatically. model = make_pipeline( StandardScaler(), RidgeCV(alphas=[0.1, 1.0, 10.0, 100.0], scoring='neg_mean_squared_error') )
This section is about choosing which predictors (variables)
to include in your linear model. The main idea is to find a “sparse”
model (one with few variables) that performs well.
The Model and The Goal
Slide: “Forward selection in Linear
Regression”
Formula: The standard linear regression model is
\(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} +
\boldsymbol{\epsilon}\)
\(\mathbf{y}\) is the \(n \times 1\) vector of outcomes.
\(\mathbf{X}\) is the \(n \times (p+1)\) matrix of predictors (with
a leading column of 1s for the intercept).
\(\boldsymbol{\beta}\) is the \((p+1) \times 1\) vector of coefficients
(\(\beta_0, \beta_1, ...,
\beta_p\)).
\(\boldsymbol{\epsilon}\) is the
\(n \times 1\) vector of irreducible
error.
Key Question: “If \(\boldsymbol{\beta}\) is sparse with at most
\(s\) non-zero entries, can forward
selection find those variables?”
Sparse means most coefficients are zero.
Forward Selection is a greedy algorithm:
Start with no variables.
Add the one variable that gives the best fit.
Add the next best variable to the existing model.
Repeat until you have a model with \(s\) variables.
The slide suggests the answer is yes, but only
under certain conditions.
The Condition for Success
Slide: “Orthogonal Matching Pursuit”
Key Concept: Forward selection can provably find
the correct variables if those variables are not strongly
correlated.
Formula: This is formalized by the Mutual
Coherence Condition: \[\mu = \max_{i
\neq j} |\langle \mathbf{x}_i, \mathbf{x}_j \rangle| < \frac{1}{2s -
1}\]
What it means:
assuming $\mathbf{x}_i$'s are normalized means we’ve
scaled them to have a length of 1.
\(\langle \mathbf{x}_i, \mathbf{x}_j
\rangle\) is the dot product, which is just their
correlation since they are normalized.
\(\mu\) (mu) is the largest
absolute correlation you can find between any two
different predictors.
\(s\) is the true number of
important variables.
In English: If the maximum correlation between any
of your predictors is less than this threshold, the greedy forward
selection algorithm is guaranteed to find the true, sparse set of
variables.
How to Choose the Model
Size (Practice)
The theory is nice, but in practice, you don’t know \(s\). How many variables should you
pick?
Slide: “10-fold CV Errors”
This is the most important practical slide for this
section.
What the plot shows:
X-axis: “Number of Variables” (from 1 to 10).
Y-axis: “CV Error” (the 10-fold cross-validated
Mean Squared Error).
The Curve: The error drops very fast as we add the
first 2-3 variables. Then, it flattens out. Adding more than 3 variables
doesn’t really help much.
Slide: “The one standard deviation
rule”
This rule helps you pick the “best” model from the CV plot.
Find the model with the absolute minimum CV error (in the
plot, this looks to be around 6 or 7 variables).
Calculate the standard error of that minimum CV error.
Draw a “tolerance” line at
(minimum error) + (one standard error).
Choose the simplest model (fewest variables) whose
CV error is below this tolerance line.
The slide states this rule “gives the model with 3 variable” for
this example. This is because the 3-variable model is much simpler than
the 6-variable one, and its error is “good enough” (within one standard
deviation of the minimum). This is an application of Occam’s
razor.
Code: R vs. Python
The R code on the “10-fold CV Errors” slide generates that exact
plot.
R Code Explained:
library(boot): Loads the cross-validation library.
CV10.err=rep(0,10): Creates an empty vector to store
the 10 error scores.
for(p in 1:10): A loop that will test model sizes from
1 to 10.
x<-which(summary(regfit.fwd)$which[p,]): Gets the
names of the \(p\) variables
chosen by a pre-run forward selection (regfit.fwd).
glm.fit=glm(Balance~.,data=newCred): Fits a model using
only those \(p\)
variables.
cv.err=cv.glm(newCred,glm.fit,K=10): Performs 10-fold
CV on that specific \(p\)-variable
model.
CV10.err[p]<-cv.err$delta[1]: Stores the CV
error.
plot(...): Plots the 10 errors against the 10 model
sizes.
Python Equivalent (Conceptual):
In scikit-learn, this process is often automated. You
wouldn’t write the CV loop yourself.
You would use sklearn.feature_selection.RFECV
(Recursive Feature Elimination with Cross-Validation). This tool
automatically wraps a model (like LinearRegression),
performs cross-validation, and finds the optimal number of features,
effectively producing the same plot and result.
# --- Python equivalent for 6.1 --- from sklearn.feature_selection import RFECV from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline # Assume X and y are your data
# 1. Create a pipeline # (Note: It's good practice to scale, even for OLS, if you're comparing) pipeline = make_pipeline( StandardScaler(), LinearRegression() )
# 2. Create the RFECV (Recursive Feature Elimination w/ CV) object # This is an *alternative* to forward selection, but serves the same purpose # It will test models with 1, 2, 3... features using 10-fold CV feature_selector = RFECV( estimator=pipeline, min_features_to_select=1, step=1, cv=10, scoring='neg_mean_squared_error'# We want to minimize error )
# 3. Fit it feature_selector.fit(X, y)
print(f"Optimal number of features found: {feature_selector.n_features_}")
# You could then plot feature_selector.cv_results_['mean_test_score'] # to replicate the R plot.
Shrinkage Methods by
Regularization
This is a different approach. Instead of removing variables,
we keep all \(p\) variables but
shrink their coefficients \(\beta_j\) towards 0.
Term 1: \(\text{RSS}\)
(Residual Sum of Squares). This is the original OLS “goodness
of fit” term. We want this to be small.
Term 2: \(\lambda \sum
\beta_j^2\). This is the \(\ell_2\) penalty or “shrinkage
penalty”. It adds a “cost” for having large coefficients.
The \(\lambda\) (lambda)
Parameter:
This is the tuning parameter that controls the
trade-off between fit and simplicity.
$\lambda = 0$: No penalty. The objective is just to
minimize RSS. The solution \(\hat{\boldsymbol{\beta}}^R\) is identical
to the OLS solution \(\hat{\boldsymbol{\beta}}\).
$\lambda = \infty$: Infinite penalty. The only way to
minimize the cost is to make all \(\beta_j =
0\) (for \(j \ge 1\)). The model
becomes an intercept-only model.
Large $\lambda$: Heavy penalty, more shrinkage.
Crucial Note: The intercept \(\beta_0\) is not
penalized. This is because \(\beta_0\) just represents the mean of \(y\) when all \(x\)’s are 0; shrinking it makes no
sense.
The Need for Standardization
Slide: “Standardize the inputs”
Problem: The penalty \(\lambda \sum \beta_j^2\) is applied to all
coefficients. But what if \(x_1\) is
“house size in sq-ft” (values 1000-5000) and \(x_2\) is “number of bedrooms” (values 1-5)?
The coefficient \(\beta_1\) for
house size will naturally be tiny, while the coefficient \(\beta_2\) for bedrooms will be
large, even if they are equally important.
Ridge regression would unfairly and heavily penalize \(\beta_2\) while barely touching \(\beta_1\).
Solution: You must standardize all
predictors before fitting a Ridge model.
Formula: For each observation \(i\) of each predictor \(j\): \[\tilde{x}_{ij} = \frac{x_{ij} -
\bar{x}_j}{\sqrt{(1/n) \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)^2}}\]
This formula rescales every predictor to have a mean of 0 and a
standard deviation of 1.
Now, all coefficients \(\beta_j\)
are on a “level playing field” and can be penalized fairly.
Answering the Discussion
Questions
Slide: “DISCUSSION”
What is the solution of Ridge regression?
What is the bias and the variance?
1. What is the
solution of Ridge regression?
The solution can be written in matrix form, which is very
elegant.
Standard OLS Solution: The coefficients \(\hat{\boldsymbol{\beta}}^{\text{OLS}}\)
that minimize RSS are found by: \[\hat{\boldsymbol{\beta}}^{\text{OLS}} =
(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
Ridge Regression Solution: The coefficients
\(\hat{\boldsymbol{\beta}}^{R}\) that
minimize the Ridge objective are: \[\hat{\boldsymbol{\beta}}^{R} =
(\mathbf{X}^T\mathbf{X} + \lambda
\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]
Explanation:
\(\mathbf{I}\) is the
identity matrix (a matrix of 1s on the diagonal, 0s
everywhere else).
By adding \(\lambda\mathbf{I}\), we
are adding a positive value \(\lambda\)
to the diagonal of the \(\mathbf{X}^T\mathbf{X}\) matrix.
This addition stabilizes the matrix. \(\mathbf{X}^T\mathbf{X}\) might not be
invertible (if \(p > n\) or if
predictors are perfectly collinear), but \((\mathbf{X}^T\mathbf{X} + \lambda
\mathbf{I})\) is always invertible for \(\lambda > 0\).
This addition is what mathematically “shrinks” the coefficients
toward zero.
2. What is the bias and the
variance?
This is the most important concept in
regularization. It’s the bias-variance trade-off.
Standard OLS (where \(\lambda=0\)):
Bias: Low. The OLS estimator is
unbiased, meaning that if you took many samples and fit
many OLS models, their average \(\hat{\boldsymbol{\beta}}\) would be the
true\(\boldsymbol{\beta}\).
Variance: High. The OLS solution can be
highly sensitive to the training data. If you change a few data
points, the coefficients can swing wildly. This is especially true if
\(p\) is large or predictors are
correlated. This “sensitivity” is high variance, which leads to
overfitting.
Ridge Regression (where \(\lambda > 0\)):
Bias: High(er). Ridge regression is a
biased estimator. By adding the penalty, we are
purposefully pulling the coefficients away from the OLS
solution and towards zero. The average \(\hat{\boldsymbol{\beta}}^R\) from many
samples will not equal the true \(\boldsymbol{\beta}\). We have
introduced bias into our model.
Variance: Low(er). In exchange for this bias, we
get a massive reduction in variance. The \(\lambda\mathbf{I}\) term stabilizes the
solution. The coefficients won’t change wildly even if the training data
changes. The model is more robust and less sensitive.
The Trade-off: The total expected test error of a
model is: \(\text{Error} = \text{Bias}^2 +
\text{Variance} + \text{Irreducible Error}\)
By using Ridge regression, we increase the \(\text{Bias}^2\) term a little, but we
decrease the \(\text{Variance}\) term a lot. The goal is
to find a \(\lambda\) where the
total error is minimized. Ridge regression reduces variance
at the cost of increased bias.
# --- Python equivalent for 6.2 --- from sklearn.linear_model import RidgeCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline # Assume X and y are your data
# 1. Create a pipeline that AUTOMATICALLY # - Standardizes the data # - Fits a Ridge Regression model # - Uses Cross-Validation to find the BEST lambda (alpha in scikit-learn) alphas_to_test = [0.01, 0.1, 1.0, 10.0, 100.0]
# RidgeCV handles everything for us pipeline = make_pipeline( StandardScaler(), RidgeCV(alphas=alphas_to_test, scoring='neg_mean_squared_error', cv=10) )
# 2. Fit the pipeline pipeline.fit(X, y)
# 3. Get the results best_lambda = pipeline.named_steps['ridgecv'].alpha_ ridge_coefficients = pipeline.named_steps['ridgecv'].coef_ intercept = pipeline.named_steps['ridgecv'].intercept_
print(f"Best lambda (alpha) found by CV: {best_lambda}") print(f"Model intercept (beta_0): {intercept}") print(f"Model coefficients (beta_j): {ridge_coefficients}")
6. The “Why” of Ridge
Regression
Core Concepts: The
“Why” of Ridge Regression
Your slides explain that ridge regression is a “shrinkage method”
designed to solve a major problem with standard Ordinary Least Squares
(OLS) regression: high variance.
The Bias-Variance Tradeoff
(Slide 3)
This is the most important theoretical concept. In prediction, the
total error (Mean Squared Error, or MSE) of a model is composed of three
parts: \(\text{Error} = \text{Variance} +
\text{Bias}^2 + \text{Irreducible Error}\)
Ordinary Least Squares (OLS): Aims to be unbiased
(low bias). However, when you have many predictors (\(p\)), especially if they are correlated, or
if \(p\) is large compared to the
number of samples \(n\) (\(p \approx n\) or \(p > n\)), the OLS model becomes highly
unstable. A small change in the training data can cause the
coefficients to change wildly. This is high variance.
(See Slide 6, “Remarks”).
Ridge Regression: By adding a penalty, ridge
intentionally introduces a small amount of
bias (it pulls coefficients away from their “true” OLS
values). In return, it achieves a massive reduction in
variance.
As Slide 3 shows:
The green line (Variance) starts very high for low
\(\lambda\) (left side) and drops
quickly.
The black line (Squared Bias) starts at zero (for
OLS at \(\lambda=0\)) and slowly
increases as \(\lambda\) grows.
The purple line (Test MSE) is the sum of the two.
It’s U-shaped. The goal of ridge is to find the \(\lambda\) (marked by the ‘x’) at the
bottom of this “U,” which gives the lowest possible total
error.
Why Is It
Called “Ridge”? The 3D Spatial Meaning (Slide 5)
This slide explains the problem of collinearity and
the origin of the name.
Left Plot (Least Squares): Imagine a model with two
correlated predictors, \(\beta_1\) and
\(\beta_2\). The y-axis (SS1) is the
error (RSS). Because the predictors are correlated, there isn’t one
single “point” that is the minimum. Instead, there’s a long, flat
valley or trough (marked “unstable”). Many different
combinations of \(\beta_1\) and \(\beta_2\) along this valley give a
similarly low error. The OLS solution is unstable because it can pick
any point in this flat-bottomed valley.
Right Plot (Ridge): The ridge objective function
adds a penalty term: \(\lambda(\beta_1^2 +
\beta_2^2)\). This penalty term, by itself, is a perfect circular
bowl centered at (0,0). When you add this “bowl” to the OLS “valley,” it
stabilizes the function. It pulls the minimum towards (0,0) and
creates a single, stable, well-defined minimum.
The “Ridge” Name: The penalty \(\lambda\mathbf{I}\) (from the matrix
formula) adds a “ridge” of values to the diagonal of the \(\mathbf{X}^T\mathbf{X}\) matrix, which
geometrically turns the unstable flat valley into a stable bowl.
Mathematical Formulas
The key difference between OLS and Ridge is the function they try to
minimize.
OLS Objective Function: Minimize the Residual
Sum of Squares (RSS). \[\text{RSS} =
\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij}
\right)^2\]
Ridge Objective Function (Slide 6): Minimize the
RSS plus an L2 penalty term. \[\text{Minimize: } \left[ \sum_{i=1}^{n} \left(
y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 \right] +
\lambda \sum_{j=1}^{p} \beta_j^2\]
\(\lambda\) is the tuning
parameter controlling the penalty strength.
\(\sum_{j=1}^{p} \beta_j^2\) is the
L2-norm (squared) of the coefficients. It penalizes
large coefficients.
L2 Norm (Slide 1): The L2 norm of a vector \(\mathbf{a}\) is its standard Euclidean
length. The plot on Slide 1 uses this to show the total
magnitude of the ridge coefficients. \[\|\mathbf{a}\|_2 = \sqrt{\sum_{j=1}^p
a_j^2}\]
Matrix Solution (Slide 6): This is the
“closed-form” solution for the ridge coefficients \(\hat{\beta}^R\). \[\hat{\beta}^R = (\mathbf{X}^T\mathbf{X} +
\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]
\(\mathbf{I}\) is the identity
matrix.
The term \(\lambda\mathbf{I}\) is
what stabilizes the \(\mathbf{X}^T\mathbf{X}\) matrix, making it
invertible even if it’s singular (due to \(p
> n\) or collinearity).
Walkthrough
of the “Credit Data” Example (All Slides)
Here is the logical story of the R code, from start to finish.
Step 1: Data Preparation (Slide
8)
x=scale(model.matrix(Balance~., Credit)[,-1])
model.matrix(...) creates the predictor matrix
x.
scale(...) is critically important. It
standardizes all predictors to have a mean of 0 and a standard deviation
of 1. This is necessary because the ridge penalty \(\lambda \sum \beta_j^2\) is
unit-dependent. If Income (in 10,000s) and
Cards (1-10) were unscaled, the penalty would unfairly
crush the Income coefficient. Scaling puts all predictors
on a level playing field.
y=Credit$Balance
This sets the y (target) variable.
Step 2: Fit the Ridge Model
(Slide 8)
grid=10^seq(4,-2,length=100)
This creates a grid of 100 \(\lambda\) values to test, ranging from
\(10^4\) (a huge penalty) down to \(10^{-2}\) (a tiny penalty).
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
This is the main command. It fits a separate ridge model
for every single \(\lambda\)
in the grid.
alpha=0 is the specific command that tells
glmnet to perform Ridge Regression.
(Setting alpha=1 would be LASSO).
coef(ridge.mod)[,50]
This inspects the model. It pulls out the vector of coefficients for
the 50th \(\lambda\) in the grid (which
is \(\lambda=10.72\)).
These plots all show the same thing: how the coefficients change as
\(\lambda\) changes.
Slide 9 Plot: This plots the standardized
coefficients for 4 predictors (Income, Limit,
Rating, Student) against the index (1
to 100). Index 1 (left) is the largest \(\lambda\), and index 100 (right) is the
smallest \(\lambda\) (closest to OLS).
You can see the coefficients “grow” from 0 as the penalty (\(\lambda\)) gets smaller.
Slide 1 (Left Plot): This is the same plot
as Slide 9, but more professional. It plots the coefficients against
\(\lambda\) on a log scale. You can
clearly see all coefficients (gray lines) being “shrunk” toward zero as
\(\lambda\) increases (moves right).
The key predictors (Income, Rating, etc.) are
highlighted.
Slide 1 (Right Plot): This is the exact same
data again, but with a different x-axis: \(\|\hat{\beta}_\lambda^R\|_2 /
\|\hat{\beta}\|_2\).
1.0 on the right means \(\lambda=0\). The ratio of the ridge norm to
the OLS norm is 1 (they are the same).
0.0 on the left means \(\lambda=\infty\). The ridge coefficients
are all 0, so their norm is 0.
This axis shows the “fraction” of the full OLS coefficient magnitude
that the model is using.
Slide 4 Plot: This plots the total L2 norm
of all coefficients (\(\|\hat{\beta}_\lambda^R\|_2\)) against the
index. As the index goes from 1 to 100 (i.e., \(\lambda\) gets smaller), the total
magnitude of the coefficients gets larger, which is exactly what we
expect.
Step
4: Find the Best\(\lambda\)
using Cross-Validation (Slides 4 & 7)
We have 100 models. Which one is best?
The “Manual” Way (Slide 4):
The code splits the data into a train and
test set.
It fits a model only on the train set.
It tests two \(\lambda\) values:
s=4: Gives a test MSE of 10293.33.
s=10: Gives a test MSE of 168981.1 (much
worse!).
This shows that \(\lambda=4\) is
better than \(\lambda=10\), but we
don’t know if it’s the best.
The “Automatic” Way (Slide 7):
cv.out=cv.glmnet(x[train,], y[train], alpha=0)
This runs 10-fold Cross-Validation on the training
set. It automatically splits the training set into 10 “folds,” trains on
9, tests on 1, and repeats this 10 times for every \(\lambda\).
The Plot: The plot on this slide is the result. It
shows the average MSE (y-axis) for each \(\log(\lambda)\) (x-axis). This is the
real-data version of the theoretical purple curve from Slide
3.
bestlam=cv.out$lambda.min
This command finds the \(\lambda\)
at the very bottom of the U-shaped curve. The output shows
bestlam is 41.6.
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, KFold from sklearn.preprocessing import StandardScaler from sklearn.linear_model import Ridge, RidgeCV from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt
# --- 1. Load and Prepare Data (like Slide 8) --- # Assuming 'Credit' is a pandas DataFrame # X = Credit.drop('Balance', axis=1) # y = Credit['Balance'] # ... (need to handle categorical variables first, e.g., with pd.get_dummies) ... # For this example, let's assume X and y are already loaded and numeric.
# --- 2. Train/Test Split (like Slide 4) --- # test_size=0.5 and random_state=1 mimic the R code X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.5, random_state=1 )
# --- 3. Find Best Lambda (alpha) with Cross-Validation (like Slide 7) --- # Create the same log-spaced grid of lambdas (sklearn calls it 'alpha') lambda_grid = np.logspace(4, -2, 100)
# RidgeCV performs cross-validation to find the best alpha # cv=10 matches the 10-fold CV # store_cv_values=True is needed to plot the CV error curve cv_model = RidgeCV(alphas=lambda_grid, store_cv_values=True, cv=10) cv_model.fit(X_train, y_train)
# Get the best lambda found best_lambda = cv_model.alpha_ print(f"Best lambda (alpha) found by CV: {best_lambda}")
# Plot the CV error curve (like Slide 7 plot) # cv_model.cv_values_ has shape (n_samples, n_alphas) # We need to average over the samples for each alpha mse_path = np.mean(cv_model.cv_values_, axis=0) plt.figure() plt.plot(np.log10(cv_model.alphas_), mse_path, marker='o') plt.xlabel("Log(lambda)") plt.ylabel("Mean Squared Error") plt.title("Cross-Validation Error Path") plt.show()
# --- 4. Evaluate on Test Set (like Slide 7) --- # 'cv_model' is already refit on the full training set using the best_lambda test_pred = cv_model.predict(X_test) final_test_mse = mean_squared_error(y_test, test_pred) print(f"Final Test MSE with best lambda: {final_test_mse}")
# --- 5. Get Final Coefficients (like Slide 7, bottom) --- # The coefficients from the CV-trained model: print(f"Intercept: {cv_model.intercept_}") print("Coefficients:") for coef, feature inzip(cv_model.coef_, X.columns): print(f" {feature}: {coef}")
# --- 6. Plot the Solution Path (like Slide 1) --- # To do this, we fit a Ridge model for each lambda and store the coefficients coefs = [] for lam in lambda_grid: model = Ridge(alpha=lam) model.fit(X_scaled, y) # Fit on all data coefs.append(model.coef_)
These slides cover Shrinkage Methods, also known as
Regularization, which are techniques used to improve on
the standard least squares model, particularly when dealing with many
variables or multicollinearity. The main focus is on
LASSO regression.
Key Mathematical Formulas
The slides present two main, but equivalent, ways to formulate these
methods.
1. Penalized Formulation (Slide
1)
This is the most common formulation. The goal is to minimize a
function that is a combination of the Residual Sum of Squares
(RSS) and a penalty term. The penalty
discourages large coefficients.
LASSO (Least Absolute Shrinkage and Selection
Operator): The goal is to find coefficients (\(\beta_0, \beta_j\)) that minimize: \[\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p}
\beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j|\]
Penalty: The \(L_1\) norm (\(\|\beta\|_1\)), which is the sum of the
absolute values of the coefficients.
Key Property: This penalty can force some
coefficients to be exactly zero, effectively performing
automatic variable selection.
2. Constrained Formulation
(Slide 2)
This alternative formulation minimizes the RSS subject to a
constraint (a “budget”) on the size of the coefficients.
For Lasso: Minimize RSS subject to: \[\sum_{j=1}^{p} |\beta_j| \le s\] (The sum
of the absolute values of the coefficients must be less than some budget
\(s\).)
For Ridge: Minimize RSS subject to: \[\sum_{j=1}^{p} \beta_j^2 \le s\] (The sum
of the squares of the coefficients (\(L_2\) norm) must be less than \(s\).)
Equivalence (Slide 3): For any penalty value \(\lambda\) used in the first formulation,
there is a corresponding budget \(s\)
in the second formulation that will give the exact same set of
coefficients. \(\lambda\) and \(s\) are inversely related: a large \(\lambda\) (high penalty) corresponds to a
small \(s\) (small budget).
Important Plots and
Interpretation
Your slides show the two most important plots for understanding and
using LASSO.
1. The Cross-Validation
(CV) Plot (Slide 5)
This plot is crucial for choosing the best tuning parameter
(\(\lambda\)).
X-axis:\(\text{Log}(\lambda)\). This is the penalty
strength.
Right side (high \(\lambda\)): High penalty, simple
model (many coefficients are 0), high bias, high Mean-Squared Error
(MSE).
Left side (low \(\lambda\)): Low penalty, complex
model (like standard linear regression), high variance, MSE starts to
increase (overfitting).
Y-axis: Mean-Squared Error (MSE) from
cross-validation.
Goal: Find the \(\lambda\) at the bottom of the “U”
shape, which gives the lowest MSE. This is the optimal
trade-off between bias and variance. The top axis shows how many
variables are included in the model at each \(\lambda\).
2. The Coefficient Path Plot
(Slide 6)
This plot is the best visualization for understanding what
LASSO does.
Left Plot (vs. \(\lambda\)):
X-axis: The penalty strength \(\lambda\).
Y-axis: The standardized value of each
coefficient.
How to read it: Start from the
right (high \(\lambda\)). All coefficients are 0. As you
move left, \(\lambda\)decreases, and the penalty is relaxed. Variables “enter” the
model one by one (their coefficients become non-zero). You can see that
‘Rating’, ‘Income’, and ‘Student’ are the most important variables, as
they are the first to become non-zero.
Right Plot (vs. \(L_1\)
Norm Ratio):
This shows the exact same information as the left plot, but the
x-axis is reversed and rescaled. An axis value of 0.0 means full penalty
(all \(\beta=0\)), and 1.0 means no
penalty.
Code Understanding (R to
Python)
The slides use the glmnet package in R. The equivalent
and most popular library in Python is scikit-learn.
1. Finding the Best \(\lambda\) (CV)
The R code cv.out=cv.glmnet(x[train,],y[train],alpha=1)
performs cross-validation to find the best \(\lambda\).
Python Equivalent: Use LassoCV. It
does the same thing: tests many \(\lambda\) values (called
alphas in scikit-learn) and picks the best one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
from sklearn.linear_model import LassoCV
# Create the LassoCV object # cv=5 means 5-fold cross-validation lasso_cv = LassoCV(cv=5, random_state=0)
# Fit the model to the training data lasso_cv.fit(X_train, y_train)
# Get the best lambda (called alpha_ in sklearn) best_lambda = lasso_cv.alpha_ print(f"Best lambda (alpha): {best_lambda}")
# Get the MSEs # This is what's plotted in the CV plot print(lasso_cv.mse_path_)
2.
Fitting with the Best \(\lambda\) and
Getting Coefficients
The R code
lasso.coef=predict(out,type="coefficients",s=bestlam) gets
the coefficients for the best \(\lambda\).
Python Equivalent: The LassoCV object
is already refitted on the full training data using the best
\(\lambda\). You can also fit a new
Lasso model with that specific \(\lambda\).
\(\sum (y_i - \mathbf{x}_i^T
\beta)^2\): This is the normal Residual Sum of
Squares (RSS). We want to make this small (fit the data
well).
\(\lambda
\|\beta\|_1\): This is the \(L_1\) penalty.
\(\|\beta\|_1 = \sum_{j=1}^{p}
|\beta_j|\) is the sum of the absolute values of the
coefficients.
\(\lambda\) (lambda) is a tuning
parameter. Think of it as a “penalty knob”.
How to think about \(\lambda\):
If \(\lambda =
0\): There is no penalty. This is just standard Ordinary
Least Squares (OLS) regression. The model will likely overfit.
If \(\lambda\) is
small: There’s a small penalty. Coefficients will
shrink a little bit.
If \(\lambda\) is very
large: The penalty is severe. The only way to
make the penalty term small is to make the coefficients (\(\beta\)) themselves small. The model will
eventually shrink all coefficients to exactly 0.
Formulation 2:
The Constrained Method (Slides 2 & 3)
This says: “Find the best-fitting model (minimize RSS) but
you have a limited ‘budget’ \(s\) for the total size of your
coefficients.”
If \(s\) is very
large: The budget is huge. This constraint does nothing.
You get the standard OLS solution.
If \(s\) is
small: The budget is tight. You must shrink
your coefficients to stay under the budget \(s\). To get the best fit, the model will be
forced to set unimportant coefficients to 0 and only “spend” its budget
on the most important variables.
The Equivalence: These two forms are equivalent. For
any \(\lambda\) you pick, there’s a
corresponding budget \(s\) that gives
the exact same solution.
High \(\lambda\) (strong penalty)
\(\iff\) Small \(s\) (tight budget)
Low \(\lambda\) (weak penalty)
\(\iff\) Large \(s\) (loose budget)
This equivalence is why you see plots with both \(\lambda\) and \(L_1\) Norm on the x-axis. They are just two
different ways of looking at the same “penalty” spectrum.
Detailed Plot & Code
Analysis
Let’s look at the plots and code, which answer the practical
questions: (1) How do we pick the best\(\lambda\)? and (2) What
does LASSO do to the coefficients?
Question 1: How
to pick the best \(\lambda\)? (Slide
5)
This is the Cross-Validation (CV) Plot. Its one and
only job is to help you find the optimal \(\lambda\).
R Code:cv.out=cv.glmnet(x[train,],y[train],alpha=1)
cv.glmnet: This R function automatically does
K-fold cross-validation. alpha=1 explicitly tells it to use
LASSO (alpha=0 would be Ridge).
It tries a whole range of \(\lambda\) values, calculates the
Mean-Squared Error (MSE) for each, and stores the results in
cv.out.
Plot Analysis:
X-axis:\(\text{Log}(\lambda)\). The penalty
strength. Right = High Penalty (simple model),
Left = Low Penalty (complex model).
Y-axis: Mean-Squared Error (MSE). Lower is
better.
Red Dots: The average MSE for each \(\lambda\).
Gray Bars: The error bars (standard error).
The “U” Shape: This is the classic
bias-variance trade-off.
Right Side (High \(\lambda\)): The model is too
simple (too many coefficients are 0). It’s “underfitting.” The
error is high (high bias).
Left Side (High \(\lambda\)): The model is too
complex (low penalty, like OLS). It’s “overfitting” the training
data. The error on new data is high (high variance).
Bottom of the “U”: This is the “sweet spot.” The
\(\lambda\) at the very bottom (marked
by the left vertical dotted line) gives the lowest possible
MSE. This is lambda.min.
Answer: You pick the \(\lambda\) that corresponds to the lowest
point on this graph.
Question 2: What
does LASSO do? (Slides 5, 6, 7)
These slides all show the effect of LASSO.
A. The Coefficient Path Plots (Slides 5 & 6)
These plots visualize how coefficients change. They show the same
information just with different x-axes.
Left Plot (Slide 6) vs. \(\lambda\):
How to read: Read from RIGHT to
LEFT.
At the far right (\(\lambda\) is
large), all coefficients are 0.
As you move left, \(\lambda\) gets
smaller, and the penalty is relaxed. Variables “enter” the model one by
one as their coefficients become non-zero.
You can see ‘Rating’ (red-dashed), ‘Student’ (black-solid), and
‘Income’ (blue-dotted) are the first to enter, suggesting they are the
most important predictors.
Right Plot (Slide 6) vs. \(L_1\) Norm Ratio:
This is the same plot, just flipped and rescaled. The
x-axis is \(\|\hat{\beta}_\lambda\|_1 /
\|\hat{\beta}_{OLS}\|_1\).
How to read: Read from LEFT to
RIGHT.
At 0.0: This is a “0% budget” (like \(s=0\) or \(\lambda=\infty\)). All coefficients are
0.
At 1.0: This is a “100% budget” (like \(s=\infty\) or \(\lambda=0\)). This is the full OLS
model.
This view clearly shows the coefficients “growing” from 0 as their
“budget” (\(L_1\) Norm) increases.
B. The Code Output (Slide 7) - This is the most important
“answer”
This slide explicitly demonstrates variable selection by
comparing the coefficients from two different \(\lambda\) values.
First Block (The “Optimal” Model):
bestlam.cv <- cv.out$lambda.min: This gets the \(\lambda\) from the bottom of the “U” in the
CV plot.
lasso.conf <- predict(out,type="coefficients",s=bestlam.cv)[1:12,]:
This gets the coefficients using that best\(\lambda\).
lasso.conf[lasso.conf!=0]: This R command filters the
list to show only the non-zero coefficients.
Result: The optimal model still keeps 10
variables (‘Income’, ‘Limit’, ‘Rating’, etc.). It has shrunk them,
but it hasn’t set many to 0.
Second Block (The “High Penalty” Model):
The slide text says “if we choose a larger regularization
parameter.” Here, they’ve picked an arbitrary larger value,
s=10. (Note: R’s predict.glmnet can be
confusing; s=10 here means \(\lambda=10\)).
lasso.conf <- predict(out,type="coefficients",s=10)[1:12,]:
This gets the coefficients using a stronger penalty (\(\lambda=10\)).
lasso.conf[lasso.conf!=0]: Again, show only the
non-zero coefficients.
Result: Look! The list is much shorter. The
coefficients for ‘Age’, ‘Education’, ‘GenderFemale’, ‘MarriedYes’, and
‘Ethnicity’ are all gone (shrunk to 0.000000). The model has
decided these are not important enough to “spend” budget on.
Conclusion: LASSO performs automatic
variable selection. By increasing \(\lambda\), you create a
sparser (simpler) model. Slide 7 is the concrete
proof.
Python Equivalents (in more
detail)
Here is how you would replicate the entire workflow from the
slides in Python.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import Lasso, LassoCV, lasso_path from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.preprocessing import StandardScaler
# --- Assume X_train, y_train, X_test, y_test are loaded --- # Example: # data = pd.read_csv('Credit.csv') # X = pd.get_dummies(data.drop(['ID', 'Balance'], axis=1), drop_first=True) # y = data['Balance'] # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# It's CRITICAL to scale data before regularization scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) feature_names = X.columns
# 1. Replicate the CV Plot (Slide 5: ...000200.png) # LassoCV does what cv.glmnet does: finds the best lambda (alpha) print("Running LassoCV to find best lambda (alpha)...") # 'alphas' is the list of lambdas to try. We can let it choose automatically. # cv=10 means 10-fold cross-validation. lasso_cv = LassoCV(cv=10, random_state=1, max_iter=10000) lasso_cv.fit(X_train_scaled, y_train)
# The best lambda found best_lambda = lasso_cv.alpha_ print(f"Best lambda (alpha) found: {best_lambda}")
# --- Plotting the CV (MSE vs. Log(Lambda)) --- # This recreates the R plot plt.figure(figsize=(10, 6)) # lasso_cv.mse_path_ is a (n_alphas, n_folds) array of MSEs # We take the mean across the folds (axis=1) mean_mses = np.mean(lasso_cv.mse_path_, axis=1) log_lambdas = np.log10(lasso_cv.alphas_)
plt.plot(log_lambdas, mean_mses, 'r.-') plt.xlabel('Log(Lambda / Alpha)') plt.ylabel('Mean-Squared Error') plt.title('LASSO Cross-Validation Path (Replicating R Plot)') # Plot a vertical line at the best lambda plt.axvline(np.log10(best_lambda), linestyle='--', color='k', label=f'Best Lambda (alpha) = {best_lambda:.2f}') plt.legend() plt.gca().invert_xaxis() # High lambda is on the right in R plot plt.show()
# 2. Replicate the Coefficient Path Plot (Slide 6: ...000206.png) # We can use the lasso_path function, or just use the CV object
# The lasso_cv object already calculated the paths! coefs = lasso_cv.path(X_train_scaled, y_train, alphas=lasso_cv.alphas_)[1].T
plt.figure(figsize=(10, 6)) for i inrange(X_train_scaled.shape[1]): plt.plot(log_lambdas, coefs[:, i], label=feature_names[i])
# 3. Replicate the Code Output (Slide 7: ...000202.png) print("\n--- Replicating R Output ---")
# --- First Block: Coefficients with BEST lambda --- print(f"Coefficients using best lambda (alpha = {best_lambda:.4f}):") # The lasso_cv object is already fitted with the best lambda best_coefs = lasso_cv.coef_ coef_series_best = pd.Series(best_coefs, index=feature_names) # This is like R's `lasso.conf[lasso.conf != 0]` print(coef_series_best[coef_series_best != 0])
# --- Second Block: Coefficients with a LARGER lambda --- # Let's pick a larger lambda, e.g., 10 (like the slide) large_lambda = 10 lasso_high_penalty = Lasso(alpha=large_lambda) lasso_high_penalty.fit(X_train_scaled, y_train)
print(f"\nCoefficients using larger lambda (alpha = {large_lambda}):") high_pen_coefs = lasso_high_penalty.coef_ coef_series_high = pd.Series(high_pen_coefs, index=feature_names) # This is the second R command: `lasso.conf[lasso.conf != 0]` print(coef_series_high[coef_series_high != 0])
# --- Final Prediction --- # This is R's `mean((lasso.pred-y.test)^2)` y_pred = lasso_cv.predict(X_test_scaled) test_mse = mean_squared_error(y_test, y_pred) print(f"\nTest MSE using best lambda: {test_mse:.2f}")
The “Game” of Regularization
First, let’s understand what these plots are showing. This is a “map”
of a constrained optimization problem.
The Red Ellipses (RSS Contours): Think of these as
contour lines on a topographic map.
The Center (\(\hat{\beta}\)): This point is the
“bottom of the valley.” It represents the perfect,
unconstrained solution—the standard Ordinary Least Squares (OLS)
coefficients. This point has the lowest possible Residual Sum of Squares
(RSS), or error.
The Lines: Every point on a single red ellipse has
the exact same RSS. As the ellipses get bigger (moving away
from the center \(\hat{\beta}\)), the
error gets higher.
The Blue Shaded Area (Constraint Region): This is
the “rule” of the game.
This is our “budget.” We are only allowed to pick a
solution (\(\beta_1, \beta_2\)) from
inside or on the boundary of this blue shape.
LASSO: The constraint is \(|\beta_1| + |\beta_2| \le s\). This
equation forms a diamond (or a rotated square).
Ridge: The constraint is \(\beta_1^2 + \beta_2^2 \le s\). This
equation forms a circle.
The Goal: Find the “best” point that is inside
the blue area.
The “best” point is the one with the lowest possible error
(RSS).
Geometrically, this means we start at the center (\(\hat{\beta}\)) and expand our ellipse
outward. The very first point where the ellipse
touches the blue constraint region is our
solution.
Why LASSO
Performs Variable Selection (The Diamond) 🎯
This is the most important concept. Look at the LASSO diagrams.
The Shape: The LASSO constraint is a
diamond.
The Key Feature: This diamond has sharp
corners (vertices). And most importantly, these corners lie
exactly on the axes.
The top corner is at \((\beta_1=0,
\beta_2=s)\).
The right corner is at \((\beta_1=s,
\beta_2=0)\).
The “Collision”: Now, imagine the red ellipses
(representing our error) expanding from the OLS solution (\(\hat{\beta}\)). They will almost always
“hit” the blue diamond at one of its sharp corners.
Look at your textbook diagram (slide ...000304.png).
The ellipse clearly makes contact with the diamond at the top corner,
where \(\beta_1 = 0\).
Look at your example (slide ...000259.jpg). The center
of the ellipses is at (4, 0.1). The closest point on the diamond that
the expanding ellipses will hit is the corner at (2, 0). At this
solution, \(y\) is exactly
0.
Conclusion: Because the \(L_1\) “diamond” has corners on the axes,
the optimal solution is very likely to land on one of them. When it
does, the coefficient for the other axis is set to
exactly zero. This is the variable selection
property.
Why Ridge
Regression Only Shrinks (The Circle) 🤏
Now, look at the Ridge regression diagram.
The Shape: The Ridge constraint is a
circle.
The Key Feature: A circle is perfectly smooth and
has no corners.
The “Collision”: Imagine the same ellipses
expanding and hitting the blue circle. The contact point will be a
tangent point.
Because the circle is round, this tangent point can be
anywhere on its circumference.
It is extremely unlikely that the contact point will be
exactly on an axis (e.g., at \((\beta_1=0,
\beta_2=s)\)). This would only happen if the OLS solution \(\hat{\beta}\) was already
perfectly aligned with that axis.
Conclusion: The Ridge solution will find a point
where both\(\beta_1\) and
\(\beta_2\) are non-zero. The
coefficients are “shrunk” (pulled in from \(\hat{\beta}\) towards the origin), but they
never become zero. This is why Ridge is called a
“shrinkage” method, but not a “variable selection” method.
Summary: Diamond vs. Circle
Feature
LASSO (\(L_1\) Norm)
Ridge (\(L_2\) Norm)
Constraint Shape
Diamond (or
hyper-rhombus)
Circle (or
hypersphere)
Key Feature
Sharp corners on the
axes
Smooth curve with no
corners
Geometric Solution
Ellipses hit the
corners
Ellipses hit a smooth
part
Result
Forces some coefficients to
exactly 0
Shrinks all coefficients towards
0
Name
Variable Selection
Shrinkage
The “space meaning” is that the sharp corners of the \(L_1\) diamond are what make variable
selection possible. The smooth circle of the \(L_2\) norm does not have these corners and
thus cannot force coefficients to zero.
8. Shrinkage Methods (Lasso
vs. Ridge)
Core Concept: Shrinkage
Methods
Both Ridge (L2) and Lasso (L1) are
regularization techniques used to improve upon standard Ordinary
Least Squares (OLS) regression.
Their main goal is to manage the bias-variance
tradeoff. OLS often has low bias but very high variance,
especially when you have many predictors (\(p\)) or when predictors are correlated.
Ridge and Lasso improve prediction accuracy by shrinking the
regression coefficients towards zero. This adds a small amount of bias
but significantly reduces the variance, leading to a lower
overall Test Mean Squared Error (MSE).
The Key Difference:
Math & How They Shrink
The slides show that the two methods use different penalties, which
leads to very different mathematical forms and practical outcomes.
What this means: Ridge shrinks every least
squares coefficient by a proportional amount. It will make coefficients
smaller, but it will never set them to exactly
zero (unless \(\lambda\) is
\(\infty\)).
What this means: This is a “soft-thresholding”
operator.
If the original coefficient \(\hat{\beta}_j^{LSE}\) is small (its
absolute value is less than \(\lambda/2\)), Lasso sets it to
exactly zero.
If the coefficient is large, Lasso subtracts \(\lambda/2\) from its absolute value,
shrinking it towards zero.
Key Property: Because of this, Lasso performs
automatic feature selection by eliminating
predictors.
Important Images Explained
Most Important: Figure 6.10
(Slide 82)
This is the best visual for understanding the mathematical
difference from the formulas above.
Left (Ridge): The red line shows the Ridge estimate
vs. the OLS estimate. It’s a straight, diagonal line with a slope less
than 1. It shrinks everything proportionally.
Right (Lasso): The red line shows the Lasso
estimate. It’s “flat” at zero for a range, showing it sets small
coefficients to zero. Then, it slopes up, but it’s shifted (it
shrinks the large coefficients by a fixed amount).
Scenario 1: Figure 6.8 (Slide
76)
This plot shows what happens when all 45 predictors are truly
related to the response.
Result (Slide 77):Ridge performs slightly
better (has a lower minimum MSE, shown by the dotted purple
line).
Why: Lasso’s assumption (that some coefficients are
zero) is wrong in this case. By forcing some relevant
predictors to zero, it adds too much bias. Ridge, by just
shrinking all of them, finds a better balance.
Scenario 2: Figure 6.9 (Slide
78)
This plot shows the opposite scenario: only 2 out of
45 predictors are truly related (a “sparse” model).
Result:Lasso performs much better
(its solid purple line has a much lower minimum MSE).
Why: Lasso’s assumption is correct. It
successfully sets the 43 “noise” predictors to zero, which dramatically
reduces variance, while correctly keeping the 2 important ones.
Python & Code
Understanding
The slides don’t contain Python code, but they describe the exact
concepts you would use, primarily in scikit-learn.
Implementing Ridge & Lasso:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline
# It's crucial to scale data before regularization # alpha is the same as the λ (lambda) in your slides
# --- Ridge --- # The math for Ridge is a "closed-form solution" (Slide 80) # ridge_model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))
The Soft-Thresholding Formula: The math from
Slide 80, \(\text{sign}(y)(|y| -
\lambda/2)_+\), is the core operation in the “coordinate descent”
algorithm used to solve Lasso. You could write it in Python/Numpy:
Choosing \(\lambda\)
(alpha): Slide 79 says to “Use cross validation to determine
which one has better prediction.” In scikit-learn, this is
done for you with RidgeCV and LassoCV, which
automatically test a range of alpha values.
Summary: Lasso vs. Ridge
Feature
Ridge (L2)
Lasso (L1)
Penalty
\(L_2\)
norm: \(\lambda \sum \beta_j^2\)
\(L_1\)
norm: \(\lambda \sum |\beta_j|\)
Coefficient
Shrinkage
Proportional; shrinks all coefficients,
but never to exactly zero.
Soft-thresholding; can force coefficients
to be exactly zero.
Feature Selection?
No
Yes, this is its main
advantage.
Interpretability
Less interpretable (keeps all \(p\) variables).
More interpretable (produces a “sparse”
model with fewer variables).
Best Used When…
…most predictors are useful. (e.g., Slide
76: 45/45 relevant).
…many predictors are “noise” and only a
few are strong. (e.g., Slide 78: 2/45 relevant).
These slides introduce shrinkage methods, also known
as regularization, a technique used in regression (like
linear regression) to improve model performance. The main idea is to add
a penalty to the model’s loss function to “shrink” the size of
the coefficients. This helps to reduce model variance and prevent
overfitting, especially when you have many features.
The two main methods discussed are Ridge Regression
(\(L_2\) penalty) and
LASSO (\(L_1\)
penalty).
Key Mathematical Formulas
Standard Linear Model: The problem starts with
the standard linear regression model (from slide 1):
\[
\]$$\mathbf{y} = \mathbf{X}\beta + \epsilon
\[
\]$$ * \(\mathbf{y}\) is the
\(n \times 1\) vector of observed
outcomes.
\(\mathbf{X}\) is the \(n \times p\) matrix of \(p\) predictor features for \(n\) observations.
\(\beta\) is the \(p \times 1\) vector of coefficients (what
we want to find).
\(\epsilon\) is the \(n \times 1\) vector of random errors.
The goal of standard “Ordinary Least Squares” (OLS) regression is to
find the \(\beta\) that minimizes the
loss: \(\|\mathbf{X}\beta -
\mathbf{y}\|^2_2\).
LASSO (L1 Regularization): LASSO (Least Absolute
Shrinkage and Selection Operator) adds a penalty based on the
absolute value of the coefficients (the \(L_1\)-norm). This is the key formula from
slide 1:
\(\lambda\) (lambda) is the
tuning parameter that controls the strength of the
penalty. A larger \(\lambda\) means
more shrinkage.
Key Property (Variable Selection): The \(L_1\) penalty can force some coefficients
(\(\beta_j\)) to become exactly
zero. This means LASSO simultaneously performs feature
selection by automatically removing irrelevant predictors.
Support (Slide 1): The question “Can it recover the
support of \(\beta\)?” is asking if
LASSO can correctly identify the set of true non-zero coefficients
(defined as \(S := \{j : \beta_j \neq
0\}\)).
Ridge Regression (L2 Regularization): Ridge
regression (mentioned on slide 2, shown on slide 3) adds a penalty based
on the squared value of the coefficients (the \(L_2\)-norm).
Key Property (Shrinkage): The \(L_2\) penalty shrinks coefficients
towards zero but never sets them to
exactly zero (unless \(\lambda =
\infty\)). It is effective at handling multicollinearity.
Important Images & Concepts
The most important images are the plots from slides 3 and 4. They
illustrate the two most critical concepts: how to choose \(\lambda\) and what the
penalty does to the coefficients.
Tuning
Parameter Selection (Slides 3 & 4, Left Plots)
Problem: How do you find the best value
for \(\lambda\)?
Solution:Cross-Validation (CV).
The slides show 10-fold CV.
What the Plots Show: The left plots on slides 3 and
4 show the Cross-Validation Error (like MSE) for
different values of the penalty.
The x-axis represents the penalty strength (either \(\lambda\) itself or a related measure like
the shrinkage ratio \(\|\hat{\beta}_\lambda\|_1 /
\|\hat{\beta}\|_1\)).
The y-axis is the prediction error.
The curve is typically U-shaped. The vertical
dashed line marks the minimum of this curve. This
minimum point corresponds to the optimal \(\lambda\), which provides the best
balance between bias and variance, leading to the best-performing model
on unseen data.
Coefficient Paths
(Slides 3 & 4, Right Plots)
These “trace” plots are crucial for understanding the difference
between Ridge and LASSO. They show how the value of each coefficient
(y-axis) changes as the penalty strength (x-axis) changes.
Slide 3 (Ridge): As \(\lambda\) increases (moving right), all
coefficient values are smoothly shrunk towards zero, but none
of them actually hit zero.
Slide 4 (LASSO): As the penalty increases (moving
from right to left, as the ratio \(s\)
goes from 1.0 to 0.0), you can see coefficients “drop off” and become
exactly zero one by one. The model with the optimal
\(\lambda\) (vertical line) has
selected only a few non-zero coefficients (the pink and teal lines),
while all the grey lines have been set to zero. This is feature
selection in action.
Key Discussion Points (Slide
2)
Non-linear models: You can apply these methods to
non-linear models by first creating non-linear features (e.g., \(x_1^2\), \(x_2^2\), \(x_1
\cdot x_2\)) and then feeding them into a LASSO or Ridge model.
The regularization will then select which of these linear or
non-linear terms are important.
Correlated Features (Multicollinearity): The
question “If \(x_j \approx x_k\), how
does LASSO behave?” is a key weakness of LASSO.
LASSO: Tends to arbitrarily select one of
the correlated features and set the others to zero. This can make the
model unstable.
Ridge: Tends to shrink the coefficients of
correlated features together, giving them similar (but smaller)
values.
Elastic Net (not shown) is a hybrid of Ridge and
LASSO that is often used to get the best of both worlds: it can select
groups of correlated variables.
Python Code
Understanding (using scikit-learn)
Here is how you would implement these concepts in Python.
# Import necessary libraries import numpy as np from sklearn.linear_model import Lasso, Ridge, LassoCV, RidgeCV from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
# --- Assume you have your data --- # X: your feature matrix (e.g., shape 100, 20) # y: your target vector (e.g., shape 100,) # X, y = ... load your data ...
# 1. It's crucial to scale your data before regularization scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
# 2. Find the optimal lambda (alpha) using Cross-Validation # scikit-learn uses 'alpha' instead of 'lambda' for the tuning parameter.
# --- For LASSO --- # LassoCV automatically performs cross-validation (e.g., cv=10) # to find the best alpha. lasso_cv_model = LassoCV(cv=10, random_state=0) lasso_cv_model.fit(X_scaled, y)
# Get the best alpha (lambda) best_alpha_lasso = lasso_cv_model.alpha_ print(f"Optimal alpha (lambda) for LASSO: {best_alpha_lasso}")
# Get the final coefficients lasso_coeffs = lasso_cv_model.coef_ print(f"LASSO coefficients: {lasso_coeffs}") # You will see that many of these are exactly 0.0
# --- For Ridge --- # RidgeCV works similarly. It's often good to test alphas on a log scale. ridge_alphas = np.logspace(-3, 3, 100) # 100 values from 0.001 to 1000 ridge_cv_model = RidgeCV(alphas=ridge_alphas, store_cv_values=True) ridge_cv_model.fit(X_scaled, y)
# Get the best alpha (lambda) best_alpha_ridge = ridge_cv_model.alpha_ print(f"Optimal alpha (lambda) for Ridge: {best_alpha_ridge}")
# Get the final coefficients ridge_coeffs = ridge_cv_model.coef_ print(f"Ridge coefficients: {ridge_coeffs}") # You will see these are small, but not exactly zero.
Bias-variance tradeoff
Key Mathematical Formulas
& Concepts
LASSO: Sign Consistency
This is the “ideal” scenario for LASSO. Sign consistency means that,
with enough data, the LASSO model not only selects the correct
set of features (it recovers the “support” \(S\)) but also correctly identifies the
sign (positive or negative) of their coefficients.
\[
\]$$ * Plain English: This formula is a complex
way of saying: The irrelevant features (\(\mathbf{X}_{S^c}\)) cannot be too strongly
correlated with the true, relevant features (\(\mathbf{X}_S\)).
If an irrelevant feature is very similar (highly correlated) to a
true feature, LASSO can get “confused” and might pick the wrong one, or
its estimate will be unstable. This condition fails.
\[
\]$$(Note: This is the \(L_2\) penalty, so \(\|\beta\|^2 = \sum
\beta_j^2\))
The Problem it Solves: Collinearity (Slide 2)
When features are strongly correlated (e.g., \(x_i \approx x_j\)), regular methods
fail:
LSE (OLS): Fails because the matrix \(\mathbf{X}^\top \mathbf{X}\) is
“non-invertible” (or singular), so the math for the solution \(\hat{\beta} = (\mathbf{X}^\top
\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) breaks down.
LASSO: Fails because the Irrepresentable
Condition is violated. LASSO will tend to arbitrarily
pick one of the correlated features and set the others to zero.
The Ridge Solution (Slide 3):
Always has a solution: Adding the \(\lambda\) penalty makes the matrix math
work, even if \(\mathbf{X}^\top
\mathbf{X}\) is non-invertible.
Groups variables: This is the key takeaway. Instead
of arbitrarily picking one feature, Ridge tends to shrink the
coefficients of collinear variables together.
Bias-Variance Tradeoff: Ridge introduces
bias into the estimates (they are “wrong” on purpose) to
massively reduce variance (they are more stable and less
sensitive to the specific training data). This trade-off usually leads
to a much lower overall error (Mean Squared Error).
Important Images & Key
Takeaways
Slide 2 (Collinearity Failures): This is the
most important “problem” slide. It clearly explains why you
can’t always use standard LSE or LASSO. The fact that all three methods
(LSE, LASSO, Forward Selection) fail with strong collinearity motivates
the need for Ridge.
Slide 3 (Ridge Properties): This is the most
important “solution” slide. The two most critical points are:
Always unique solution for λ > 0
Collinear variables tend to be grouped! (This is the
“fix” for the problem on Slide 2).
Python Code Understanding
Let’s demonstrate the key difference (Slide 3) in
how LASSO and Ridge handle collinear features.
We will create two features, x1 and x2,
that are nearly identical.
import numpy as np from sklearn.linear_model import Lasso, Ridge
# 1. Create a dataset with 2 strongly correlated features np.random.seed(0) n_samples = 100 # x1: a standard feature x1 = np.random.randn(n_samples) # x2: almost identical to x1 x2 = x1 + 0.01 * np.random.randn(n_samples)
# Combine into our feature matrix X X = np.c_[x1, x2]
# y: The target variable (let's say y = 2*x1 + 2*x2) y = 2 * x1 + 2 * x2 + np.random.randn(n_samples)
# 2. Fit LASSO (alpha is the same as lambda) # We use a moderate alpha lasso_model = Lasso(alpha=1.0) lasso_model.fit(X, y)
# 3. Fit Ridge (alpha is the same as lambda) ridge_model = Ridge(alpha=1.0) ridge_model.fit(X, y)
# 4. Compare the coefficients print("--- Results for Correlated Features ---") print(f"True Coefficients: [2.0, 2.0]") print(f"LASSO Coefficients: {np.round(lasso_model.coef_, 2)}") print(f"Ridge Coefficients: {np.round(ridge_model.coef_, 2)}")
Example Output:
1 2 3 4
--- Results for Correlated Features --- True Coefficients: [2.0, 2.0] LASSO Coefficients: [3.89 0. ] Ridge Coefficients: [1.95 1.94]
Code Explanation:
LASSO: As predicted by the slides, LASSO failed to
find the true model. It arbitrarily picked x1,
gave it a large coefficient, and set x2 to
zero. This is unstable and not what we wanted.
Ridge: As predicted by Slide 3, Ridge handled the
collinearity perfectly. It identified that both x1 and
x2 were important and “grouped” them by
assigning them nearly identical, stable coefficients (1.95 and 1.94),
which are very close to the true values of 2.0.
10. Elastic Net
Overall Summary
These slides introduce Elastic Net, a modern
regression method that solves the major weaknesses of its two
predecessors, Ridge and LASSO
regression.
Ridge is good for collinearity
(correlated features) but can’t do variable selection
(it can’t set any feature’s coefficient to exactly zero).
LASSO is good for variable
selection (it creates sparse models by setting
coefficients to zero) but behaves unstably when
features are correlated (it tends to randomly pick one and discard the
others).
Elastic Net combines the L1 penalty of LASSO and the
L2 penalty of Ridge. The result is a single, flexible model that:
Performs variable selection (like LASSO).
Handles correlated features stably by grouping them
together (like Ridge).
Can select more features than samples (\(p
> n\)), which LASSO cannot do.
Slide 1:
The Definition and Formula (File: ...020245.png)
This slide explains why Elastic Net was created and defines
it mathematically.
The Problem: It states the exact trade-off:
“Ridge regression can handle collinearity, but cannot perform
variable selection;”
“LASSO can perform variable selection, but performs poorly when
collinearity;”
The Solution (The Formula): The core of the method
is this optimization formula: \[\hat{\beta}_{eNet}(\lambda, \alpha) \leftarrow
\arg \min_{\beta} \left( \underbrace{\|\mathbf{y} -
\mathbf{X}\beta\|^2}_{\text{Loss}} + \lambda \left(
\underbrace{\alpha\|\beta\|_1}_{\text{L1 Penalty}} +
\underbrace{\frac{1-\alpha}{2}\|\beta\|_2^2}_{\text{L2 Penalty}} \right)
\right)\]
Breaking Down the Formula:
\(\|\mathbf{y} -
\mathbf{X}\beta\|^2\): This is the standard “Residual
Sum of Squares” (RSS). We want to find coefficients (\(\beta\)) that make the model’s predictions
(\(X\beta\)) as close as possible to
the true values (\(y\)).
\(\lambda\)
(Lambda): This is the master knob for
total regularization strength. A larger \(\lambda\) means a bigger penalty, which
“shrinks” all coefficients more.
\(\alpha\)
(Alpha): This is the mixing parameter that
balances L1 and L2. This is the key innovation.
\(\alpha\|\beta\|_1\): This is the
L1 (LASSO) part. It forces weak coefficients to become
exactly zero, thus selecting variables.
\(\frac{1-\alpha}{2}\|\beta\|_2^2\):
This is the L2 (Ridge) part. It shrinks all
coefficients and, crucially, encourages correlated features to have
similar coefficients (the grouping effect).
The Special Cases:
If \(\alpha = 0\),
the L1 term vanishes, and the model becomes pure Ridge
Regression.
If \(\alpha = 1\),
the L2 term vanishes, and the model becomes pure LASSO
Regression.
If \(0 < \alpha <
1\), you get Elastic Net, which
“encourages grouping of correlated variables” and “can perform
variable selection.”
Slide
2: The Intuition and The Grouping Effect (File:
...020249.jpg)
This slide gives you the visual intuition and the
practical proof of why Elastic Net works. It has two parts.
Part 1: The Three
Graphs (Geometric Intuition)
These graphs show the constraint region (the shaded shape)
for each penalty. The model tries to find the best coefficients (\(\theta_{opt}\)), and the final solution
(the green dot) is the first point where the cost function (the blue
ellipses) “touches” the constraint region.
L1 Norm (LASSO): The region is a
diamond. Because of its sharp corners,
the ellipses are very likely to hit a corner first. At a corner, one of
the coefficients (e.g., \(\theta_1\))
is zero. This is a visual explanation of how LASSO creates
sparsity (variable selection).
L2 Norm (Ridge): The region is a
circle. It has no corners. The
ellipses will hit a “smooth” point on the circle, shrinking both
coefficients (\(\theta_1\) and \(\theta_2\)) but not setting either to zero.
This is weight sharing.
L1 + L2 (Elastic Net): The region is a
“rounded square”. It’s the perfect compromise.
It has “corners” (like LASSO) so it can still set coefficients to
zero.
It has “curved edges” (like Ridge) so it’s more stable and handles
correlated variables by finding a solution on an edge rather than a
single sharp corner.
Part 2: The Formula (The
Grouping Effect)
The text at the bottom explains Elastic Net’s “grouping effect.”
The Implication: “If \(x_j \approx x_k\), then \(\hat{\beta}_j \approx
\hat{\beta}_k\).”
Meaning: If two features (\(x_j\) and \(x_k\)) are highly correlated (their values
are very similar), Elastic Net will force their coefficients
(\(\hat{\beta}_j\) and \(\hat{\beta}_k\)) to also be very
similar.
Why this is good: This is the opposite of
LASSO. LASSO would be unstable and might arbitrarily set \(\hat{\beta}_j\) to a large value and \(\hat{\beta}_k\) to zero. Elastic Net
“groups” them: it will either keep both in the model with
similar importance, or it will shrink both of them out of the
model together. This is a much more stable and realistic result.
The Warning: “LASSO may be unstable in this case!”
This directly highlights the problem that Elastic Net solves.
Slide
3: The Feature Comparison Table (File: ...020255.png)
This table is your “cheat sheet” for choosing the right model. It
compares Ridge, LASSO, and Elastic Net on all their key properties.
Penalty: Shows the L2, L1, and combined
penalties.
Sparsity: Can the model set coefficients to 0?
Ridge: No ❌
LASSO: Yes ✅
Elastic Net: Yes ✅
Variable Selection: This is a crucial row.
LASSO: Yes ✅, BUT it has a major limitation: if
you have more features than samples (\(p >
n\)), LASSO can select at most\(n\) features.
Elastic Net: Yes ✅, and it can select more
than \(n\) variables. This
makes it the clear choice for “wide” data problems (e.g., in genomics,
where \(p=20,000\) features and \(n=100\) samples).
Grouping Effect: How does it handle correlated
features?
Ridge: Strong ✅
LASSO: Weak ❌ (it “picks one”)
Elastic Net: Strong ✅
Solution Uniqueness: Is the answer stable?
Ridge: Always ✅
LASSO: No ❌ (not if \(X\) is “rank-deficient,” e.g., \(p > n\) or correlated features)
Elastic Net: Always ✅ (as long as \(\alpha < 1\), the Ridge component
guarantees a unique, stable solution).
Use Case: When should you use each?
Ridge: For prediction, especially with
multicollinearity.
LASSO: For interpretability and
creating sparse models (when you think only a few
features matter).
Elastic Net: The best all-arounder. Use it for
correlated predictors, when \(p \gg n\), or when you need both
sparsity + stability.
Code Understanding
(Python scikit-learn)
When you use this in Python, be aware of a common confusion in the
parameter names:
Concept (from your slides)
scikit-learn Parameter
Description
\(\lambda\) (Lambda)
alpha
The overall strength of
regularization.
\(\alpha\) (Alpha)
l1_ratio
The mixing parameter
between L1 and L2.
Example: An l1_ratio of 0
is Ridge. An l1_ratio of 1 is LASSO. An
l1_ratio of 0.5 is a 50/50 mix.
from sklearn.linear_model import ElasticNet, ElasticNetCV
# 1. Initialize a specific model # This uses 0.5 for lambda (slide's alpha) and 0.1 for lambda (slide's lambda) model = ElasticNet(alpha=0.1, l1_ratio=0.5)
# 2. A much better way: Find the best parameters automatically # This will test l1_ratios of 0.1, 0.5, and 0.9 # and automatically find the best 'alpha' (strength) for each. cv_model = ElasticNetCV( l1_ratio=[.1, .5, .9], cv=5# 5-fold cross-validation )
# 3. Fit the model to your data (X_train, y_train) # cv_model.fit(X_train, y_train)
# 4. See the best parameters it found # print(f"Best l1_ratio (slide's alpha): {cv_model.l1_ratio_}") # print(f"Best alpha (slide's lambda): {cv_model.alpha_}")
11. High-Dimensional Data
Analysis
The Core Problem: Large \(p\), Small \(n\)
The slides introduce the challenge of high-dimensional data, which is
defined by having many more features (predictors) \(p\) than observations (samples) \(n\). This is often written as
\(p \gg n\).
Example: Predicting blood pressure (the response
\(y\)) using millions of genetic
markers (SNPs) as features \(X\), but
only having data from a few hundred patients.
Troubles:
Overfitting: Models become “too flexible” and learn
the noise in the training data, rather than the true underlying
pattern.
Non-Unique Solution: When \(p > n\), the standard least squares
linear regression model doesn’t even have a unique solution.
Misleading Metrics: This leads to a common symptom:
a very small training error (or high \(R^2\)) but a very large test
error.
Most
Important Image: The Overfitting Trap (Figure 6.23)
Figure 6.23 (from the first uploaded image) is the most critical
visual for understanding the problem. It shows what happens
when you add features (variables) that are completely unrelated
to the outcome.
Left Plot (R²): The \(R^2\) on the training data increases
towards 1. This looks like a perfect fit.
Center Plot (Training MSE): The Mean Squared Error
on the training data decreases to 0. This also looks
perfect.
Right Plot (Test MSE): The Mean Squared Error on
the test data (new, unseen data) explodes. This reveals the
model is garbage and has just memorized the training set.
⚠️ This is the key takeaway: In high dimensions,
\(R^2\) and training MSE are
useless and misleading metrics for
model quality.
The Solution:
Regularization & Model Selection
To combat overfitting, we must use less flexible
models. The main strategy is regularization
(also called shrinkage), which involves adding a penalty term to the
cost function to “shrink” the model coefficients (\(\beta\)).
Mathematical Formulas &
Python Code 🐍
The standard Least Squares cost function you try to
minimize is: \[\text{RSS} = \sum_{i=1}^n
\left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 \quad
\text{or} \quad \|y - X\beta\|^2_2\] This fails when \(p > n\). The solutions modify this:
A. Ridge Regression (\(L_2\) Penalty)
Concept: Shrinks all coefficients towards zero, but
never to zero. It’s good when many features are related to the
outcome.
The \(\lambda \sum_{j=1}^p
\beta_j^2\) is the \(L_2\) penalty.
\(\lambda\) (lambda) is a
tuning parameter that controls the penalty strength. A larger
\(\lambda\) means more shrinkage.
Python (Scikit-learn):
1 2 3 4 5 6 7 8 9 10 11 12 13
from sklearn.linear_model import Ridge from sklearn.model_selection import cross_val_score
# alpha is the lambda (λ) tuning parameter # We find the best alpha using cross-validation ridge_model = Ridge(alpha=1.0)
# Fit the model ridge_model.fit(X_train, y_train)
# Evaluate using test error (e.g., MSE on test set) # NOT with training R-squared test_score = ridge_model.score(X_test, y_test)
B. The Lasso (\(L_1\) Penalty)
Concept: This is a very important method. The \(L_1\) penalty can force coefficients to be
exactly zero. This means Lasso performs
automatic feature selection, creating a sparse
model.
The \(\lambda \sum_{j=1}^p
|\beta_j|\) is the \(L_1\) penalty.
Again, \(\lambda\) is the tuning
parameter.
Python (Scikit-learn):
1 2 3 4 5 6 7 8 9 10 11
from sklearn.linear_model import Lasso
# alpha is the lambda (λ) tuning parameter lasso_model = Lasso(alpha=0.1)
# Fit the model lasso_model.fit(X_train, y_train)
# The model automatically selects features # Coefficients that are zero were 'dropped' print(lasso_model.coef_)
C. Other Methods
The slides also mention:
Forward Stepwise Selection: A different approach
where you start with no features and add them one by one, picking the
one that improves the model most (based on a criterion like
cross-validation error).
Principal Components Regression (PCR): A
dimensionality reduction technique.
The Curse of
Dimensionality (Figure 6.24)
This example (Figures 6.24 and its description) shows a more subtle
problem.
Setup: A model with \(n=100\) observations and 20 true
features.
Plots: They test Lasso by adding more and more
irrelevant features:
\(p=20\) (Left):
Lasso performs well. The lowest test MSE is found with minimal
regularization.
\(p=50\) (Center):
Lasso still works well, but it needs more regularization (a smaller
“Degrees of Freedom”) to filter out the 30 junk features.
\(p=2000\)
(Right): This is the curse of dimensionality.
Even with a good method like Lasso, the 1,980 irrelevant features add so
much noise that the model performs poorly regardless of
the tuning parameter. The true signal is “lost in the noise.”
Summary: Cautions for \(p > n\)
The final slide gives the most important rules to follow:
Beware Extreme Multicollinearity: When \(p > n\), your features are
mathematically guaranteed to be linearly related, which breaks standard
regression.
Don’t Overstate Results: A model you find (e.g.,
with Lasso) is just one of many potentially good models.
🚫 DO NOT USE training \(R^2\), \(p\)-values, or training MSE to justify your
model. As Figure 6.23 showed, they are misleading.
✅ DO USEtest error and
cross-validation error to choose your model and assess
its performance.
The Core Problem:
\(p \gg n\) (The “Troubles” Slide)
This slide (filename: ...020259.png) sets up the entire
problem. The issue isn’t just “overfitting”; it’s a fundamental
mathematical breakdown of standard methods.
“Large \(p\) makes our
linear regression model too flexible”: This is an
understatement. It leads to a problem called an underdetermined
system.
“If \(p > n\), the LSE
is not even uniquely determined”: This is the most important
technical point.
Mathematical Reason: The standard solution for
Ordinary Least Squares (OLS) is \(\hat{\beta}
= (X^T X)^{-1} X^T y\).
\(X\) is the data matrix with \(n\) rows (observations) and \(p\) columns (features).
The matrix \(X^T X\) has dimensions
\(p \times p\).
When \(p > n\), the \(X^T X\) matrix is
singular, which means its determinant is zero and it
cannot be inverted. The \((X^T X)^{-1}\) term does not exist.
“Extreme multicollinearity” (from slide
...020744.png) is the direct cause. When \(p > n\), the columns of \(X\) (the features) are guaranteed
to be linearly dependent. There are infinite combinations of the
features that can explain the data.
The Simplest Example: \(n=2\) (Figure 6.22)
This slide (filename: ...020728.png) is the
perfect illustration of the “not uniquely determined”
problem.
Left Plot (Low-D): Many points (\(n\)), only two parameters (\(p=2\): intercept \(\beta_0\) and slope \(\beta_1\)). The line is a “best fit” that
balances the errors. The training error (RSS) is non-zero.
Right Plot (High-D): We have \(n=2\) observations and \(p=2\) parameters.
You have two equations (one for each point) and two unknowns (\(\beta_0\) and \(\beta_1\)).
The model has exactly enough flexibility to pass
perfectly through both points.
The result is zero training error.
This “perfect” fit is an illusion. If you got a new data
point, this line would almost certainly be a terrible predictor. This is
the essence of overfitting.
The Consequence:
Misleading Metrics (Figure 6.23)
This slide (filename: ...020730.png) scales up the
problem from \(n=2\) to \(n=20\) and shows why you must be
cautious.
The Setup:\(n=20\) observations. We start with 1
feature and add more and more irrelevant, junk features.
Left Plot (\(R^2\)): The \(R^2\) on the training data steadily
increases towards 1 as we add features. This is because, by pure chance,
each new junk feature can explain a tiny bit more of the noise in the
training set.
Center Plot (Training MSE): The training error
drops to 0. This is the same as the \(n=2\) plot. Once the number of features
(\(p\)) gets close to the number of
observations (\(n=20\)), the model can
perfectly fit the 20 data points, even if the features are random
noise.
Right Plot (Test MSE): This is the “truth.” The
actual error on new, unseen data gets worse and worse. By
adding noise features, we are just “memorizing” the training set, and
our model’s ability to generalize is destroyed.
Key Lesson: (from slide ...020744.png)
This is why you must “Avoid using… \(p\)-values, \(R^2\), or other traditional measures of
model on training as evidence of good fit.” They are guaranteed
to lie to you when \(p > n\).
The Solutions (The “Deal
with…” Slide)
This slide (filename: ...020734.png) lists the
strategies to fix this. The core idea is regularization
(or shrinkage). We add a “penalty” to the cost function to stop the
\(\beta\) coefficients from getting too
large or too numerous.
A. Ridge Regression (\(L_2\) Penalty)
Concept: Keeps all \(p\) features, but shrinks their
coefficients. It’s excellent for handling multicollinearity.
The \(\lambda \sum |\beta_j|\) is
the \(L_1\) penalty.
This absolute value penalty is what allows coefficients to become
exactly 0.
Benefit: The final model is sparse (e.g.,
it might say “out of 2,000 features, only these 15 matter”).
C. Tuning Parameter
Choice (The Real Work)
How do you pick the best \(\lambda\)? You must use the data you have.
The slides mention this and “cross validation error” (from
...020744.png).
Python Code (Scikit-learn): You don’t just guess
alpha (which is \(\lambda\) in scikit-learn). You use a tool
like LassoCV or GridSearchCV to find the best
one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
from sklearn.linear_model import LassoCV from sklearn.datasets import make_regression
# Create a high-dimensional dataset X, y = make_regression(n_samples=100, n_features=500, n_informative=10, noise=0.1)
# LassoCV automatically performs cross-validation to find the best alpha (lambda) # cv=10 means 10-fold cross-validation lasso_cv_model = LassoCV(cv=10, random_state=0, max_iter=10000)
# Fit the model lasso_cv_model.fit(X, y)
# This is the best lambda (alpha) it found: print(f"Best alpha (lambda): {lasso_cv_model.alpha_}")
# You can now see the coefficients # Most of the 500 coefficients will be 0.0 print(f"Number of non-zero features: {np.sum(lasso_cv_model.coef_ != 0)}")
A Final
Warning: The Curse of Dimensionality (Figure 6.24)
This final set of slides (filenames: ...020738.png and
...020741.jpg) provides a crucial, subtle warning:
Regularization is not magic.
The Setup:\(n=100\) observations. There are 20
real features that truly affect the response.
The Experiment: They run Lasso three times, adding
more and more noise features:
Left Plot (\(p=20\)): All 20 features are real.
The lowest test MSE is found with minimal regularization (high “Degrees
of Freedom,” meaning many non-zero coefficients). This makes sense; you
want to keep all 20 real features.
Center Plot (\(p=50\)): Now we have 20 real
features + 30 noise features. Lasso still works! The best model is found
with more regularization (fewer “Degrees of Freedom”). Lasso
successfully “zeroed out” many of the 30 noise features.
Right Plot (\(p=2000\)): This is the
curse of dimensionality. We have 20 real features +
1980 noise features. The noise has completely overwhelmed the
signal. Lasso fails. The test MSE is high
no matter what tuning parameter you choose. The model cannot
distinguish the 20 real features from the 1980 junk ones.
Final Takeaway: Even with advanced methods like
Lasso, if your \(p \gg n\) problem is
too extreme (i.S. the signal-to-noise ratio is too low), it may
be impossible to build a good predictive model.
The Goal: “Collaborative
Filtering”
The first slide (...021218.png) uses the term
Collaborative Filtering. This is the key concept. The
model “collaborates” by using the ratings of all users to fill
in the blanks for a single user.
How it works: The model assumes your “taste”
(vector \(\mathbf{u}_i\)) can be
described as a combination of \(r\)
“latent features” (e.g., \(r=3\): %
action, % comedy, % drama). It also assumes each movie (vector
\(\mathbf{v}_j\)) has a profile on
these same features.
Your predicted rating for a movie is the dot product of your taste
vector and the movie’s feature vector.
The model finds the best “taste” vectors \(\mathbf{U}\) and “movie” vectors \(\mathbf{V}\) that explain all the known
ratings simultaneously. It’s collaborative because Lee’s
ratings help define the features of “Bullet Train” (\(\mathbf{v}_2\)), which in turn helps
predict Yang’s rating for that same movie.
The Hard Problem (and its 2
Flavors)
The second slide (...021222.png) presents the intuitive,
but computationally very hard, way to frame the problem.
Detail 1: Noise vs. No Noise
The slide shows \(\mathbf{Y} = \mathbf{M} +
\mathbf{E}\). This is critical. * \(\mathbf{M}\) is the “true,” “clean,”
underlying low-rank matrix of everyone’s “true” preferences. * \(\mathbf{E}\) is a matrix of random noise.
(e.g., your true rating is 4.3, but you entered a 4; or you were in a
bad mood and rated a 3). * \(\mathbf{Y}\) is the noisy data we
actually observe.
Because of this noise, we don’t expect to find a matrix \(\mathbf{N}\) that perfectly
matches our data. Instead, we try to find a low-rank \(\mathbf{N}\) that is as close as
possible. This leads to the formula: \[\underset{\text{rank}(\mathbf{N}) \le
r}{\text{minimize}} \quad \left\| \mathcal{P}_{\mathcal{O}}(\mathbf{Y} -
\mathbf{N}) \right\|_{\text{F}}^2\] This says: “Find a matrix
\(\mathbf{N}\) (of rank \(r\) or less) that minimizes the sum of
squared errors only on the ratings we observed (\(\mathcal{O}\)).”
Detail
2: Why is \(\text{rank}(\mathbf{N}) \le
r\) a “Non-convex constraint”?
This is the “difficult to optimize” part. A convex problem is
(simplistically) one with a single valley, making it easy to find the
single lowest point. A non-convex problem has many local valleys, and an
algorithm can get stuck in a “pretty good” valley instead of the “best”
one.
The rank constraint is non-convex. For example, the average of two
rank-1 matrices is not necessarily a rank-1 matrix (it could be
rank-2). This lack of a “smooth valley” property makes the problem
NP-hard.
Detail 3: The Number
of Parameters: \(r(d_1 + d_2)\)
The slide asks, “how many entries are needed?” The answer is based on
the number of unknown parameters. * A rank-\(r\) matrix \(\mathbf{M}\) can be factored into \(\mathbf{U}\) (which is \(d_1 \times r\)) and \(\mathbf{V}^T\) (which is \(r \times d_2\)). * The number of entries in
\(\mathbf{U}\) is \(d_1 \times r\). * The number of entries in
\(\mathbf{V}\) is \(d_2 \times r\). * Total “unknowns” to solve
for: \(d_1 r + d_2 r = r(d_1 + d_2)\).
* This means we must have at least\(r(d_1 + d_2)\) observed ratings to have any
hope of uniquely solving for \(\mathbf{U}\) and \(\mathbf{V}\). If our number of observations
\(|\mathcal{O}|\) is less than this,
the problem is hopelessly underdetermined.
The “Magic” Solution:
Convex Relaxation
The final slide (...021225.png) presents the
groundbreaking solution from Candès and Recht. This solution cleverly
changes the problem to one that is convex and solvable.
Detail
1: The L1-Norm Analogy (This is the most important concept)
This is the key to understanding why this works.
In Vectors (Lasso):
Hard Problem: Find the sparsest vector
\(\beta\) (fewest non-zeros). This is
\(L_0\) norm, \(\text{minimize } \|\beta\|_0\). This is
non-convex.
Easy Problem: Minimize the \(L_1\) norm, \(\text{minimize } \|\beta\|_1 = \sum
|\beta_j|\). This is convex, and it’s a “relaxation” that
also produces sparse solutions.
In Matrices (Matrix Completion):
Hard Problem: Find the lowest-rank matrix
\(\mathbf{X}\). Rank is the number of
non-zero singular values. This is \(\text{minimize } \text{rank}(\mathbf{X})\).
This is non-convex.
Easy Problem: Minimize the Nuclear
Norm, \(\text{minimize }
\|\mathbf{X}\|_* = \sum \sigma_i(\mathbf{X})\) (where \(\sigma_i\) are the singular values). This
is convex, and it’s the “matrix equivalent” of the \(L_1\) norm. It’s a relaxation that
also produces low-rank solutions.
Detail 2: Noiseless
vs. Noisy (Again)
Notice the constraint in this new problem: \[\text{Minimize } \quad \|\mathbf{X}\|_*\]\[\text{Subject to } \quad X_{ij} = M_{ij},
\quad (i, j) \in \mathcal{O}\]
This formulation is for the noiseless case. It
assumes the \(M_{ij}\) we observed are
perfectly accurate. It demands that our solution \(\mathbf{X}\)exactly matches the
known ratings. This is different from the optimization problem on the
previous slide, which just tried to get close to the noisy data
\(\mathbf{Y}\).
(In practice, you solve a noisy-aware version that combines both
ideas, but the slide shows the original, “exact completion”
problem.)
Detail
3: The Guarantee (What the math at the bottom means)
\[\text{If } \mathcal{O} \text{ is
randomly sampled and } |\mathcal{O}| \gg r(d_1+d_2)\log(d_1+d_2),
\text{... then the solution is unique and } \mathbf{M}
\text{...}\]
This is the punchline. The Candès paper proved that if you
have enough (but still very few) randomly sampled
ratings, solving this easy convex problem (minimizing the nuclear norm)
will magically give you the exact, true, low-rank matrix \(\mathbf{M}\).
\(|\mathcal{O}| \gg
r(d_1+d_2)\): This part makes sense. We need at
least as many observations as our \(r(d_1+d_2)\) degrees of freedom.
\(\log(d_1+d_2)\):
This “log” factor is the “price” we pay for not knowing where
the information is. It’s an astonishingly small price.
Example: For a 1,000,000 user x 10,000 movie matrix
(like Netflix) with \(r=10\), you don’t
need \(\approx 10^{10}\) ratings. You
need a number closer to \(10 \times (10^6 +
10^4) \times \log(\dots)\), which is dramatically
smaller. This is why this method is practical.
Resampling as a statistical tool to assess the
accuracy of models whose main goal is to estimate the test
error (a model’s performance on new, unseen data) because the
training error is overly optimistic due to overfitting.
Resampling: The process of repeatedly drawing
samples from a dataset. The two main types mentioned are
Cross-validation (to estimate model test error) and
Bootstrap (to quantify the uncertainty of estimates).
从数据集中反复抽取样本的过程。主要提到的两种类型是交叉验证(用于估计模型测试误差)和自举(用于量化估计的不确定性)。
Data Splitting (Ideal Scenario): In a “data-rich”
situation, you split your data into three parts:
**在“数据丰富”的情况下,您可以将数据拆分为三部分:
Training Data: Used to fit and train the parameters
of various models.用于拟合和训练各种模型的参数。
Validation Data: Used to assess the trained models,
tune hyperparameters (e.g., choose the polynomial degree), and select
the best model. This helps prevent
overfitting.用于评估已训练的模型、调整超参数(例如,选择多项式的次数)并选择最佳模型。这有助于防止过度拟合。
Test Data: Used only once on the final,
selected model to get an unbiased estimate of its real-world
performance.
在最终选定的模型上仅使用一次,以获得其实际性能的无偏估计。
Validation vs. Test Data: The slides emphasize this
difference (Slide 7). The validation set is part of the
model-building and selection process. The test set is
kept separate and is only used for the final report card after all
decisions are
made.验证集是模型构建和选择过程的一部分。测试集是独立的,仅在所有决策完成后用于最终报告。
The Validation Set Approach
This is the simplest cross-validation
method.这是最简单的交叉验证方法。
Split: The total dataset is randomly divided into
two parts: a training set and a validation
set (often a 50/50 or 70/30
split).将整个数据集随机分成两部分:训练集和验证集(通常为
50/50 或 70/30 的比例)。
Train: Various models are fit only on the
training
set.各种模型仅在训练集上进行拟合。
Validate: The performance of each trained model is
evaluated using the validation set.
使用验证集评估每个训练模型的性能。
Select: The model with the best performance (e.g.,
the lowest error) on the validation set is chosen as the final model.
选择在验证集上性能最佳(例如,误差最小)的模型作为最终模型。
Important Image: Schematic
(Slide 10)
This diagram clearly shows a set of \(n\) observations being randomly split into
a training set (blue, with observations 7, 22, 13) and a validation set
(beige, with observation 91). The model learns from the blue set and is
tested on the beige set. 此图清晰地展示了一组 \(n\)
个观测值被随机分成训练集(蓝色,观测值编号为
7、22、13)和验证集(米色,观测值编号为
91)。模型从蓝色数据集进行学习,并在米色数据集上进行测试。
Example: Auto Data (Formulas
& Code)
The slides use the Auto dataset to decide the best
polynomial degree to predict mpg from
horsepower.
Mathematical Models
The models being compared are polynomials of different degrees. For
example:
The performance metric used is the Mean Squared Error
(MSE) on the validation set:
使用的性能指标是验证集上的均方误差 (MSE): \[MSE_{val} = \frac{1}{n_{val}} \sum_{i \in val}
(y_i - \hat{f}(x_i))^2\] where \(n_{val}\) is the number of observations in
the validation set, \(y_i\) is the true
mpg value, and \(\hat{f}(x_i)\) is the model’s prediction
for the \(i\)-th observation in the
validation set. 其中 \(n_{val}\)
是验证集中的观测值数量, \(y_i\)
是真实的 mpg 值,\(\hat{f}(x_i)\) 是模型对验证集中第 \(i\) 个观测值的预测。 ### Important Image:
Polynomial Fits (Slide 8) 多项式拟合(幻灯片 8)
This plot is crucial. It shows the Auto data with linear
(red), quadratic (green), and cubic (blue) regression lines. * The
linear fit is clearly poor. * The quadratic and
cubic fits follow the data’s curve much better. * The inset box
shows the MSE calculated on the full dataset (this is training
MSE): * Linear MSE: ~26.42 * Quadratic MSE: ~21.60 * Cubic MSE: ~21.51
This suggests a non-linear fit is necessary, but it doesn’t tell us
which one will generalize better.
1. Python Code (Slide 9): Model Selection
Criteria
What it does: This Python code (using
pandas and statsmodels) does not
implement the validation set approach. Instead, it fits polynomial
models (degrees 1 through 5) to the entire dataset.
How it works: It calculates statistical criteria
like BIC, Mallow’s \(C_p\), and Adjusted \(R^2\). These are mathematical
adjustments to the training error that estimate the test error
without needing a validation set.
它计算统计标准,例如BIC、Mallow 的
\(C_p\)** 和调整后的 \(R^2\)。这些是对训练误差的数学调整,无需验证集即可估算测试误差。
Key line (logic):sm.OLS(y, X).fit()
is used to fit the model, and then metrics like model.bic
and model.rsquared_adj are extracted.
Result: The table shows that the model with
[horsepower, horsepower2] (quadratic) has the lowest BIC
and \(C_p\) values, suggesting it’s the
best model according to these criteria.
2. R Code (Slides 14 & 15): The Validation Set
Approach
What it does: This R code directly
implements the validation set approach described on Slide 13.
How it works:
set.seed(...): Sets a random seed to make the split
reproducible.
train=sample(392, 196): Randomly selects 196 indices
(out of 392) to be the training set.
lm.fit=lm(mpg~poly(horsepower, 2), ..., subset=train):
Fits a quadratic model only using the train
data.
mean((mpg-predict(lm.fit,Auto))[-train]^2): This is the
key calculation.
predict(lm.fit, Auto): Predicts mpg for
all data.
[-train]: Selects only the predictions for the
validation set (the data not in
train).
mean(...): Calculates the MSE on the validation
set.
Result: The code is run three times with different
seeds (1, 2022, 1997).
Seed 1: Quadratic MSE (18.71) is lowest.
Seed 2022: Quadratic MSE (19.70) is lowest.
Seed 1997: Quadratic MSE (19.08) is lowest.
Main Takeaway: In all random splits, the
quadratic model gives the lowest validation set MSE.
This provides evidence that the quadratic model is the best choice for
generalizing to new data. The fact that the MSE values change with each
seed also highlights a key disadvantage of this simple method:
the results can be variable depending on the random split.
主要结论:在所有随机拆分中,**二次模型的验证集 MSE
最低。这证明了二次模型是推广到新数据的最佳选择。MSE
值随每个种子变化的事实也凸显了这种简单方法的一个关键缺点:结果可能会因随机拆分而变化。
2. The Validation Set
Approach 验证集方法
This method is a simple way to estimate a model’s performance on new,
unseen data (the “test error”).
这种方法是一种简单的方法,用于评估模型在新的、未见过的数据(“测试误差”)上的性能。
The core idea is to randomly split your available data
into two parts:
其核心思想是将可用数据随机拆分为两部分: 1.
Training Set: Used to fit (or “train”) your model.
用于拟合(或“训练”)模型。 2. Validation Set (or Test
Set): Used to evaluate the trained model’s performance. You
calculate the error (like Mean Squared Error) on this set.
用于评估训练后的模型性能。计算此集合的误差(例如均方误差)。
Python Code Explained (Slide
1)
The first slide shows a Python example using the Auto
dataset to predict mpg from horsepower.
Setup & Data Loading:
import statements load libraries like
pandas (for data),
sklearn.model_selection.train_test_split (the key function
for this method), and
sklearn.linear_model.LinearRegression.
Auto = pd.read_csv(...) loads the data.
X = Auto['horsepower'].values and
y = Auto['mpg'].values select the variables of
interest.
The Split:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=007)
This is the most important line for this method. It
splits the data X and y into training and
testing (validation) sets.
train_size=0.5 means 50% of the data is for training
and 50% is for validation.
random_state=007 ensures the split is “random” but
“reproducible” (using the same seed 007 will always produce
the same split).
Model Fitting & Evaluation:
The code fits three different polynomial models, but it only
uses the training data (X_train,
y_train) to do so.
Linear (Degree 1): A simple
LinearRegression.
Quadratic (Degree 2): Uses
PolynomialFeatures(2) to create \(x\) and \(x^2\) terms, then fits a linear model to
them.
Cubic (Degree 3): Uses
PolynomialFeatures(3) to create \(x\), \(x^2\), and \(x^3\) terms.
It then calculates the Mean Squared Error (MSE) for
all three models using the test data
(X_test, y_test).
Results (from the text on the slide):
Linear MSE:\(\approx
23.3\)
Quadratic MSE:\(\approx
19.4\)
Cubic MSE:\(\approx
19.4\)
Conclusion: The quadratic model gives a
significantly lower error than the linear model. The cubic model does
not offer any real improvement over the quadratic one.
结果(来自幻灯片上的文字):
线性均方误差:约 23.3
二次均方误差:约 19.4
三次均方误差:约 19.4
结论:二次模型的误差显著低于线性模型。三次模型与二次模型相比并没有任何实质性的改进。
Key Images: The
Problem with a Single Split
The most important images are on slide 9 (labeled
“Figure” and “Page 20”).
Plot on the Left (Single Split): This graph shows
the validation MSE for polynomial degrees 1 through 10, based on the
single random split from the R code (slide 2). Just like the
Python example, it shows that the MSE drops sharply from degree 1 to 2,
and then stays relatively low. Based on this one chart, you
might pick degree 2 (quadratic) as the best model.
Plot on the Right (Ten Splits): This is the
most critical plot. It shows the results of
repeating the entire process 10 times, each with a new random
split (from R code on slide 3).
You can see 10 different error curves.
While they all agree that degree 1 (linear) is bad, they do
not agree on the best model. Some curves suggest degree 2 is
best, others suggest 3, 4, or even 6.
这是最关键的图表**。它显示了重复整个过程 10
次的结果,每次都使用新的随机分割(来自幻灯片 3 上的 R 代码)。
The slides repeatedly emphasize the two main drawbacks of this simple
validation set approach:
High Variability 高变异性: The estimated test
MSE can be highly variable, depending on which
observations happen to land in the training set versus the validation
set. The plot with 10 curves (slide 9, right) proves this perfectly.
估计的测试 MSE
可能高度变异,具体取决于哪些观测值恰好落在训练集和验证集中。包含
10 条曲线的图表(幻灯片 9,右侧)完美地证明了这一点。
Overestimation of Test Error 高估测试误差:
The model is only trained on a subset (e.g., 50%)
of the available data. The validation data is “wasted” and not used for
model building.
Statistical methods tend to perform worse when trained on fewer
observations.
Therefore, the model trained on just the training set is likely
worse than a model trained on the entire dataset.
This “worse” model will have a higher error rate on the
validation set. This means the validation set MSE tends to
overestimate the true test error you would get from a model
trained on all your data.
The slides introduce Cross-Validation (CV) as the
method to overcome these drawbacks. The core idea is to use all
data points for both training and validation, just at different times.
交叉验证
(CV),以此来克服这些缺点。其核心思想是将所有数据点用于训练和验证,只是使用的时间不同。
This is the first type of CV introduced (slide 10, page 26). For a
dataset with \(n\) data points:
Hold out the 1st data point (this is your
validation set).
保留第一个数据点(这是你的验证集)。
Train the model on the other \(n-1\) data points. 使用其他 \(n-1\)
个数据点训练模型。
Calculate the error (e.g., \(\text{MSE}_1\)) using only that 1st
held-out point. 仅使用第一个保留点计算误差(例如,\(\text{MSE}_1\))。
Repeat this \(n\)
times, holding out the 2nd point, then the 3rd, and so on, until every
point has been used as the validation set exactly once.
重复此操作 \(n\)
次,保留第二个点,然后是第三个点,依此类推,直到每个点都作为验证集使用一次。
Your final test error estimate is the average of all \(n\) errors.
最终的测试误差估计是所有 \(n\)
个误差的平均值。
Key Formula (from Slide 10)
The formula for the \(n\)-fold LOOCV
error estimate is: \(n\) 倍 LOOCV
误差估计公式为: \[\text{CV}_{(n)} =
\frac{1}{n} \sum_{i=1}^{n} \text{MSE}_i\]
Where: * \(n\) is the total number
of data points. 是数据点的总数。 * \(\text{MSE}_i\) is the Mean Squared Error
calculated on the \(i\)-th data point
when it was held out. 是保留第 \(i\)
个数据点时计算的均方误差。
3.What is LOOCV
(Leave-One-Out Cross Validation)
Leave-One-Out Cross Validation (LOOCV) is a method for estimating the
test error of a model. For a dataset with \(n\) observations, you: 留一交叉验证 (LOOCV)
是一种估算模型测试误差的方法。对于包含 \(n\) 个观测值的数据集,您需要:
Fit the model \(n\) times.
对模型进行 \(n\) 次拟合
For each fit \(i\) (from \(1\) to \(n\)), you train the model on all data
points except for observation \(i\). 对于每个拟合 \(i\) 个样本(从 \(1\) 到 \(n\)),您需要在除观测值 \(i\) 之外的所有数据点上训练模型。
You then use this trained model to make a prediction for the single
observation \(i\) that was left out.
然后,您需要使用这个训练好的模型对被遗漏的单个观测值 \(i\) 进行预测。
The final LOOCV error is the average of the \(n\) prediction errors (typically the Mean
Squared Error, or MSE). 最终的 LOOCV 误差是 \(n\)
个预测误差的平均值(通常为均方误差,简称 MSE)。
This process is shown visually in the slide titled “LOOCV” (slide
27), which is a key image for understanding the concept. Pros
& Cons (from slide 28): * Pro: It has low
bias because the training set (\(n-1\)
samples) is almost identical to the full dataset.由于训练集(\(n-1\)
个样本)与完整数据集几乎完全相同,因此偏差较低。 * Pro:
It produces a stable, non-random error estimate (unlike \(k\)-fold CV, which depends on the random
fold assignments). 它能产生稳定的非随机误差估计(不同于 k
倍交叉验证,后者依赖于随机折叠分配)。 * Con: It can be
extremely computationally expensive, as the model must
be refit \(n\) times.
由于模型必须重新拟合 \(n\)
次,计算成本极其高昂。 * Con: The \(n\) error estimates can be highly
correlated, which can sometimes lead to high variance in the final \(CV\) estimate. 这 \(n\) 个误差估计可能高度相关,有时会导致最终
\(CV\) 估计值出现较大方差。
Key Mathematical Formulas
The main challenge of LOOCV (being computationally expensive) has a
very efficient solution for linear models. LOOCV
的主要挑战(计算成本高昂)对于线性模型来说,有一个非常有效的解决方案。
1. The Standard (Slow) Formula
As defined on slide 33, the LOOCV estimate of the MSE is:
\(y_i\) is the true value of the
\(i\)-th observation. 是第 \(i\) 个观测值的真实值。
\(\hat{y}_i^{(i)}\) is the
predicted value for \(y_i\) from a
model trained on all data except observation \(i\). 是使用除观测值 \(i\) 之外的所有数据训练的模型对 \(y_i\) 的预测值。
Calculating \(\hat{y}_i^{(i)}\)
requires refitting the model \(n\)
times. 计算 \(\hat{y}_i^{(i)}\)
需要重新拟合模型 \(n\) 次。
2. The Shortcut (Fast) Formula
Slide 34 provides a much simpler formula that only requires
fitting the model once on the entire dataset:
只需对整个*数据集进行一次模型拟合**:
\(\hat{y}_i\) is the prediction for
\(y_i\) from the model trained on
all \(n\) data points.
是使用所有 \(n\)
个数据点训练的模型对 \(y_i\)
的预测值。
\(h_i\) is the
leverage of the \(i\)-th observation. 是第 \(i\)
个观测值的杠杆率。
3. What is Leverage (\(h_i\))?
Slide 35 defines leverage:
Hat Matrix (\(\mathbf{H}\)): In a linear model,
the fitted values \(\hat{\mathbf{y}}\)
are related to the true values \(\mathbf{y}\) by the hat matrix: \(\hat{\mathbf{y}} =
\mathbf{H}\mathbf{y}\).
Formula: The hat matrix is defined as \(\mathbf{H} =
\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\).
Leverage (\(h_i\)): The leverage for the \(i\)-th observation is simply the \(i\)-th diagonal element of the hat matrix,
\(h_{ii}\) (often just written as \(h_i\)).
Meaning: Leverage measures how “influential” an
observation’s \(x_i\) value is in
determining its own predicted value \(\hat{y}_i\). A high leverage score means
that point has a lot of influence on the model’s fit.
This shortcut formula is extremely important because it makes LOOCV
as fast to compute as a single model
fit.这个快捷公式非常重要,因为它使得 LOOCV
的计算速度与单个模型拟合一样快。
Python Code Explained (Slide
29)
This slide shows how to use LOOCV to select the best polynomial
degree for predicting mpg from horsepower.
Imports: It imports standard libraries
(pandas, matplotlib) and key modules from
sklearn:
LinearRegression: The model to be fit.
PolynomialFeatures: A tool to create polynomial terms
(e.g., \(x, x^2, x^3\)).
LeaveOneOut: The LOOCV cross-validation strategy
object.
cross_val_score: A function that automatically runs a
cross-validation test.
Setup:
It loads the Auto.csv data.
It defines \(X\)
(horsepower) and \(y\)
(mpg).
It creates a LeaveOneOut object:
loo = LeaveOneOut().
Looping through Degrees:
The code loops degree from 1 to 10.
make_pipeline: For each degree, it
creates a model using make_pipeline. This
pipeline is a crucial concept:
It first runs PolynomialFeatures(degree) to transform
\(X\) into \([X, X^2, ..., X^{\text{degree}}]\).
It then feeds those features into LinearRegression() to
fit the model.
cross_val_score: This is the most
important line.
scores = cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error')
This function automatically does the entire LOOCV process.
It takes the model (the pipeline), the data \(X\) and \(y\), and the CV strategy
(cv=loo).
sklearn’s cross_val_score uses the “fast”
leverage method internally for linear models, so it doesn’t actually fit
the model \(n\) times.
It uses scoring='neg_mean_squared_error' because the
scoring function assumes “higher is better.” By calculating
the negative MSE, the best model will have the highest score
(i.e., closest to 0).
Storing Results: It calculates the mean of the
scores (which is the \(CV_{(n)}\)) and
stores it.
Visualization:
The code then plots the final cv_errors (after flipping
the sign back to positive) against the degree.
The resulting plot (also on slide 32) shows the test MSE, allowing
you to visually pick the best degree (where the error is
minimized).
Slide 27 (.../103628.png): This is
the best conceptual image. It visually demonstrates how
LOOCV splits the data \(n\) times, with
each observation getting one turn as the validation set.
这是最佳概念图**。它直观地展示了 LOOCV 如何将数据拆分
\(n\)
次,每个观察值都会被旋转一次作为验证集。
Slide 34 (.../103711.png): This
slide presents the most important formula: the “Easy
formula” or shortcut, \(CV_{(n)} = \frac{1}{n}
\sum (\frac{y_i - \hat{y}_i}{1 - h_i})^2\). This is the key
takeaway for computing LOOCV efficiently in linear models.
这张幻灯片展示了最重要的公式**:“简单公式”或简称,\(CV_{(n)} = \frac{1}{n} \sum (\frac{y_i -
\hat{y}_i}{1 - h_i})^2\)。这是在线性模型中高效计算 LOOCV
的关键要点。
Slide 32 (.../103701.jpg): This is
the key results image. It contrasts the LOOCV error
curve (left) with the 10-fold CV error curves (right). It clearly shows
that LOOCV produces a single, stable error curve, while 10-fold CV
results vary slightly each time it’s run due to the random data splits.
这是关键结果图**。它将 LOOCV 误差曲线(左)与 10 倍 CV
误差曲线(右)进行了对比。它清楚地表明,LOOCV
产生了单一、稳定的误差曲线,而由于数据分割的随机性,10 倍 CV
的结果每次运行时都会略有不同。
4. Cross-Validation Overview
These slides explain Cross-Validation (CV), a method
used to estimate the test error of a model, helping to select the best
level of flexibility (e.g., the best polynomial degree). It’s an
improvement over a single validation set because it uses all the data
for both training and validation at different times.
这是一种用于估算模型测试误差的方法,有助于选择最佳的灵活性(例如,最佳多项式次数)。它比单个验证集有所改进,因为它在不同时间使用所有数据进行训练和验证。
The two main types discussed are K-fold CV and
Leave-One-Out CV (LOOCV). 主要讨论的两种类型是K
折交叉验证和留一法交叉验证 (LOOCV)。
K-Fold Cross-Validation K
折交叉验证
This is the most common method.
The Process
As shown in the slides, the K-fold CV process is: 1.
Divide the dataset randomly into \(K\) non-overlapping groups (or “folds”),
usually of equal size. Common choices are \(K=5\) or \(K=10\). 将数据集随机划分为
\(K\)
个不重叠的组(或“折”),通常大小相等。常见的选择是 \(K=5\) 或 \(K=10\)。 2. Iterate \(K\) times: In each iteration \(i\), use the \(i\)-th fold as the validation
set and all other \(K-1\)
folds combined as the training set. 迭代 \(K\) 次:在每次迭代 \(i\) 中,使用第 \(i\)
个样本集作为验证集,并将所有其他 \(K-1\)
个样本集合并作为训练集。 3. Calculate
the Mean Squared Error (\(MSE_i\)) on
the validation fold. 计算验证集的均方误差 (\(MSE_i\))。 4. Average all
\(K\) error estimates to get the final
CV score. 平均所有 \(K\) 个误差估计值,得到最终的 CV 分数。 ###
Key Formula The final K-fold CV error estimate is the average of the
errors from each fold: 最终的 K 折 CV
误差估计值是每个样本集误差的平均值: \[CV_{(K)} = \frac{1}{K} \sum_{i=1}^{K}
MSE_i\]
Important Image: The Concept
The diagram in slide 104145.png is the most important
for understanding the concept of K-fold CV. It shows a dataset
split into 5 folds (\(K=5\)). The
process is repeated 5 times, with a different fold (in beige) held out
as the validation set in each run, while the rest (in blue) is used for
training. 它展示了一个被分成 5 个样本集 (\(K=5\)) 的数据集。该过程重复 5
次,每次运行都会保留一个不同的折叠(米色)作为验证集,其余折叠(蓝色)用于训练。
Leave-One-Out
Cross-Validation (LOOCV)
LOOCV is just a special case of K-fold CV where \(K = n\) (the total number of
observations). LOOCV 只是 K 折交叉验证的一个特例,其中 \(K = n\)(观测值总数)。 * You
create \(n\) “folds,” each containing
just one data point. 创建 \(n\)
个“折叠”,每个折叠仅包含一个数据点。 * You train the model \(n\) times, each time leaving out a
single different observation and then calculating the error for
that one point. 对模型进行 \(n\)
次训练,每次都省略一个不同的观测值,然后计算该点的误差。
Key Formulas
Standard Definition: The LOOCV error is the
average of the \(n\) squared errors:
\[CV = \frac{1}{N} \sum_{i=1}^{N}
e_{[i]}^2\] where \(e_{[i]} = y_i -
\hat{y}_{[i]}\) is the prediction error for the \(i\)-th observation, calculated from a model
that was trained on all data except the \(i\)-th observation. This looks
computationally expensive. LOOCV 误差是 \(n\) 个平方误差的平均值: \[CV = \frac{1}{N} \sum_{i=1}^{N}
e_{[i]}^2\] 其中 \(e_{[i]} = y_i -
\hat{y}_{[i]}\) 是第 \(i\)
个观测值的预测误差,该误差由一个使用除第 \(i\)
个观测值以外的所有数据训练的模型计算得出。这看起来计算成本很高。
Fast Computation (for Linear Regression): A key
point from the slides is that for linear regression, you don’t need to
re-fit the model \(N\) times. You can
fit the model once on all \(N\) data points and use the following
shortcut: \[CV = \frac{1}{N} \sum_{i=1}^{N}
\left( \frac{e_i}{1 - h_i} \right)^2\]
\(e_i = y_i - \hat{y}_i\) is the
standard residual (from the model fit on all data).
\(h_i\) is the leverage
statistic for the \(i\)-th
observation (the \(i\)-th diagonal
entry of the “hat matrix” \(H\)). This
makes LOOCV as fast to compute as a single model fit.
对于线性回归,您无需重新拟合模型 \(N\)
次。您可以对所有 \(N\)
个数据点一次性地拟合模型,并使用以下快捷方式: \[CV = \frac{1}{N} \sum_{i=1}^{N} \left(
\frac{e_i}{1 - h_i} \right)^2\]
The Python code in slide 104156.jpg shows how to use
10-fold CV to find the best polynomial degree for a model.
Code Understanding (Slide
104156.jpg)
Here’s a breakdown of the key sklearn parts:
from sklearn.pipeline import make_pipeline:
This is used to chain steps. The pipeline
make_pipeline(PolynomialFeatures(degree), LinearRegression())
first creates polynomial features (like \(x\), \(x^2\), \(x^3\)) and then fits a linear model to
them.
from sklearn.model_selection import KFold:
This object is used to define the \(K\)-fold split strategy.
kf = KFold(n_splits=10, shuffle=True, random_state=1)
creates a 10-fold splitter that shuffles the data first.
from sklearn.model_selection import cross_val_score:
This is the most important function.
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
This one function does all the work: it takes the model
(the pipeline), the data X and y, and the CV
splitter kf. It automatically trains and evaluates the
model 10 times and returns an array of 10 scores (one for each
fold).
scoring='neg_mean_squared_error' is used because
cross_val_score expects a higher score to be
better. Since we want to minimize MSE, we use
negative MSE.
avg_mse = -scores.mean(): The code
averages the 10 scores and flips the sign back to positive to get the
final CV (MSE) estimate for that polynomial degree.
Important Image: The Results
The plots in slides 104156.jpg (Python) and
104224.png (R) show the key result.
X-axis: Degree of Polynomial (model
complexity).多项式的次数(模型复杂度)。
Interpretation: The plot shows a clear “U” shape.
The error is high for degree 1 (a simple line), drops to its minimum at
degree 2 (a quadratic \(ax^2
+ bx + c\)), and then starts to rise again for higher degrees.
This rise indicates overfitting—the more complex models
are fitting the training data’s noise, leading to worse performance on
unseen validation data. 该图呈现出清晰的“U”形。1
次(一条简单的直线)时误差较大,在2 次(二次 \(ax^2 + bx +
c\))时降至最小,然后随着次数的增加,误差再次上升。这种上升表明过拟合——更复杂的模型会拟合训练数据的噪声,导致在未见过的验证数据上的性能下降。
Conclusion: The 10-fold CV analysis suggests that a
quadratic model (degree 2) is the best choice, as it
provides the lowest estimated test error. 10 倍 CV
分析表明二次模型(2
次)是最佳选择,因为它提供了最低的估计测试误差。
Let’s dive into the details of that proof.
Detailed
Summary: The “Fast Computation of LOOCV” Proof
The most mathematically dense and important part of your slides is
the proof (spanning slides 104126.jpg,
104132.png, and 104136.png) that LOOCV, which
seems computationally very expensive, can be calculated quickly for
linear regression. LOOCV
虽然计算成本看似非常高,但对于线性回归来说,它可以快速计算。 ### The
Goal
The goal is to prove that the LOOCV statistic, which is defined as:
\[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2
\quad \text{where } e_{[i]} = y_i - \hat{y}_{[i]}\] (Here, \(\hat{y}_{[i]}\) is the prediction for \(y_i\) from a model trained on all data
except point \(i\)).(其中,\(\hat{y}_{[i]}\) 表示基于除点 \(i\) 之外的所有数据训练的模型对 \(y_i\) 的预测)。
…can be computed without re-fitting the model \(N\) times, using this “fast” formula:
无需重新拟合模型 \(N\)
次即可计算,使用以下“快速”公式: \[CV =
\frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\]
(Here, \(e_i\) is the standard
residual and \(h_i\) is the
leverage, both from a single model fit on all
data).
The entire proof boils down to showing one identity: \(e_{[i]} = e_i / (1 - h_i)\).
Key
Definitions (The Matrix Algebra Setup) (矩阵代数设置)
Model 模型:\(\mathbf{Y}
= \mathbf{X}\beta + \mathbf{e}\)
Full Data Estimate 完整数据估计 (\(\hat{\beta}\)):\(\hat{\beta} =
(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\)
Hat Matrix 帽子矩阵 (\(\mathbf{H}\)):\(\mathbf{H} =
\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)
Full Data Residual 完整数据残差 (\(e_i\)):\(e_i = y_i - \hat{y}_i = y_i -
\mathbf{x}_i^T\hat{\beta}\)
Leverage (\(h_i\)) 杠杆
(\(h_i\)): The \(i\)-th diagonal element of \(\mathbf{H}\). \(h_i =
\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\)
The proof’s “trick” is to relate the “full data” matrix \((\mathbf{X}^T\mathbf{X})\) to the
“leave-one-out” matrix \((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})\).
证明的“技巧”是将“全数据”矩阵 \((\mathbf{X}^T\mathbf{X})\) 与“留一法”矩阵
\((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})\)
关联起来。
The full sum-of-squares matrix is just the leave-one-out matrix
plus the one observation’s contribution:
完整的平方和矩阵就是留一法矩阵加上一个观测值的贡献:
This means: \(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]} =
\mathbf{X}^T\mathbf{X} - \mathbf{x}_i\mathbf{x}_i^T\)
Step 2: The Key
Matrix Trick (Slide 104132.png)
We need the inverse \((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}\)
to calculate \(\hat{\beta}_{[i]}\).
Finding this inverse directly is hard. Instead, we use the
Sherman-Morrison-Woodbury formula, which tells us how
to find the inverse of a matrix that’s been “updated” (in this case, by
subtracting \(\mathbf{x}_i\mathbf{x}_i^T\)).
The slide applies this formula to get: \[(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1} =
(\mathbf{X}^T\mathbf{X})^{-1} +
\frac{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}}{1
- h_i}\] * This is the most complex step, but it’s a standard
matrix identity. It’s crucial because it expresses the “leave-one-out”
inverse in terms of the “full data” inverse \((\mathbf{X}^T\mathbf{X})^{-1}\), which we
already have.
Now we can write a new formula for \(\hat{\beta}_{[i]}\) by substituting the
result from Step 2. We also note that \(\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]} =
\mathbf{X}^T\mathbf{Y} - \mathbf{x}_i y_i\).
The slide then shows the algebra to simplify this big expression.
When you expand and simplify everything, you get a much cleaner
result:
\[\hat{\beta}_{[i]} = \hat{\beta} -
(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \frac{e_i}{1 - h_i}\] *
This is a beautiful result! It says the LOOCV coefficient vector is just
the full coefficient vector minus a small adjustment term
related to the \(i\)-th observation’s
residual (\(e_i\)) and leverage (\(h_i\)). * 这是一个非常棒的结果!它表明
LOOCV 系数向量就是完整的系数向量减去一个与第 \(i\) 个观测值的残差 (\(e_i\)) 和杠杆率 (\(h_i\)) 相关的小调整项。
Step 4: Finding \(e_{[i]}\) (Slide
104136.png)
This is the final step. We use the definition of \(e_{[i]}\) and the result from Step 3.
这是最后一步。我们使用 \(e_{[i]}\)
的定义和步骤 3 的结果。
Start with the definition:\(e_{[i]} = y_i -
\mathbf{x}_i^T\hat{\beta}_{[i]}\)
This gives the final, simple relationship: \[e_{[i]} = \frac{e_i}{1 - h_i}\]
Conclusion
By proving this identity, the slides show that to get all \(N\) of the “leave-one-out” errors, you only
need to: 1. Fit one linear regression model on
all the data. 2. Calculate the standard residuals \(e_i\) and the leverage values \(h_i\) for all \(N\) points. 3. Apply the formula \(e_i / (1 - h_i)\) for each point.
This turns a procedure that looked like it would take \(N\) times the work into a procedure that
takes only 1 model fit. This is why LOOCV is a
practical and efficient method for linear regression.
The central purpose of cross-validation is to estimate the
true test error of a machine learning model. This is crucial
for:
Model Assessment: Evaluating how well a model will
perform on new, unseen data. 评估模型在新的、未见过的数据上的表现。
Model Selection: Choosing the best level of model
flexibility (e.g., the degree of a polynomial or the value of \(K\) in KNN) to avoid
overfitting.
选择最佳的模型灵活性水平(例如,多项式的次数或 KNN 中的 \(K\)
值),以避免过拟合。
As the slides show, training error (the error on the
data the model was trained on) consistently decreases as model
complexity increases. However, the test error follows a
U-shape: it first decreases (as the model learns the true signal) and
then increases (as the model starts fitting the noise, or
“overfitting”). CV helps find the minimum point of this U-shaped test
error curve.
训练误差(模型训练数据的误差)随着模型复杂度的增加而持续下降。然而,测试误差呈现
U
形:它先下降(当模型学习真实信号时),然后上升(当模型开始拟合噪声,即“过拟合”时)。交叉验证有助于找到这条
U 形测试误差曲线的最小值。
Important Images 🖼️
The most important image is on Slide 61.
These two plots perfectly illustrate the concept:
Blue Line (Training Error): Always goes down.
Brown Line (True Test Error): Forms a “U” shape.
This is what we want to find the minimum of, but it’s unknown
in practice.
Black Line (10-fold CV Error): This is our
estimate of the test error. Notice how closely it tracks the
brown line. The minimum of the CV curve (marked with an ‘x’) is very
close to the minimum of the true test error.
This shows why CV works: it provides a reliable estimate to
guide our choice of model (e.g., polynomial degree 3-4 for logistic
regression, or \(K \approx 10\) for
KNN).
For regression, we often use Mean Squared Error (MSE). For
classification, the slides introduce the classification error
rate.
For Leave-One-Out Cross-Validation (LOOCV), the error for a single
observation \(i\) is: \[Err_i = I(y_i \neq \hat{y}_i^{(i)})\] *
\(y_i\) is the true label for
observation \(i\). * \(\hat{y}_i^{(i)}\) is the model’s prediction
for observation \(i\) when the model
was trained on all other observations except\(i\). * \(I(\dots)\) is an indicator
function: it’s \(1\) if the
condition is true (prediction is wrong) and \(0\) if false (prediction is correct).
The total CV error is simply the average of these
individual errors, which is the overall fraction of incorrect
classifications: \[CV_{(n)} = \frac{1}{n}
\sum_{i=1}^{n} Err_i\] The slides also show examples using
Log Loss (Slide 64), which is another common and
sensitive metric for classification. The logistic regression model
itself is defined by: \[P(Y=1 | X) =
\frac{1}{1 + \exp(-\beta_0 - \beta_1 X_1 - \beta_2 X_2 -
\dots)}\]
The slides provide two key Python examples. Both manually implement
K-fold cross-validation to show how it works.
1. KNN Regression (Slide 52)
KNN 回归
Goal: Find the best n_neighbors (K)
for a KNeighborsRegressor. 为
KNeighborsRegressor 找到最佳的 n_neighbors
(K)。
Logic:
It creates a KFold object to split the data into 10
folds (n_splits=10). 创建一个 KFold
对象,将数据拆分成 10 个折叠(n_splits=10)。
It has an outer loop that iterates through
different values of \(K\) (from 1 to
10). 它有一个 外循环,迭代不同的 \(K\) 值(从 1 到 10)。
It has an inner loop that iterates through the 10
folds (for train_index, test_index in kfold.split(X)).
它有一个 内循环,迭代这 10
个折叠(for train_index, test_index in kfold.split(X))。
Inside the inner loop:
It trains a KNeighborsRegressor on the 9 training folds
(X_train, y_train).
It makes predictions on the 1 held-out test fold
(X_test).
It calculates the mean squared error for that fold and stores
it.
After the inner loop: It averages the 10 error
scores (one from each fold) to get the final CV error for that specific
\(K\). 对 10
个误差分数(每个集一个)求平均值,得到该特定 \(K\) 的最终 CV 误差。
The final plot shows this CV error vs. \(K\), allowing us to pick the \(K\) with the lowest error. 最终图表显示了
CV 误差与 \(K\)
的关系,使我们能够选择误差最小的 \(K\)。
2.
Logistic Regression with Polynomials (Slide 64)
使用多项式的逻辑回归
Goal: Find the best degree for
PolynomialFeatures used with
LogisticRegression.
Logic: This is very similar to the KNN example but
uses a different model and error metric.
It sets up a 10-fold split (kf = KFold(...)).
An outer loop iterates through the
degree\(d\) (from 1 to
10).
An inner loop iterates through the 10 folds.
Inside the inner loop:
It creates PolynomialFeatures of degree \(d\).
It transforms the 9 training folds (X_train) into
polynomial features (X_train_poly).
It trains a LogisticRegression model on
X_train_poly.
It transforms the 1 held-out test fold (X_test) using
the same polynomial transformer.
It calculates the log_loss on the test fold.
After the inner loop: It averages the 10
log_loss scores to get the final CV error for that
degree.
The plot shows CV error vs. degree, and the minimum is clearly at
degree=3.
The
Bias-Variance Trade-off in CV CV 中的偏差-方差权衡
This is a key theoretical point from Slide 54 that
answers the questions on Slide 65. It compares LOOCV (\(K=n\)) with K-fold CV (\(K=5\) or \(10\)). 这是幻灯片
54中的一个关键理论点,它回答了幻灯片 65 中的问题。它比较了
LOOCV(K=n)和 K 倍 CV(K=5 或 10)。
LOOCV (K=n):
Bias: Very low. The model is
trained on \(n-1\) samples, which is
almost the full dataset. The resulting error estimate is nearly unbiased
for the true test error. 该模型基于 \(n-1\)
个样本进行训练,这几乎是整个数据集。得到的误差估计对于真实测试误差几乎没有偏差。
Variance: Very high. You are
training \(n\) models that are
almost identical to each other (they only differ by one data
point). Averaging these highly correlated error estimates doesn’t reduce
the variance much, making the CV estimate unstable.
非常高。您正在训练 \(n\)
个彼此几乎相同的模型(它们仅相差一个数据点)。对这些高度相关的误差估计求平均值并不能显著降低方差,从而导致
CV 估计不稳定。
K-Fold CV (K=5 or 10):
Bias: Slightly higher than LOOCV.
The models are trained on, for example, 90% of the data. Since they are
trained on less data, they might perform slightly worse. This
means K-fold CV tends to slightly overestimate the true test
error (Slide 66).
Variance: Much lower than LOOCV.
The 10 models are trained on more different “chunks” of data (they
overlap less), so their error estimates are less correlated. Averaging
less-correlated estimates significantly reduces the overall
variance.
Conclusion: We generally prefer 10-fold
CV over LOOCV. It gives a much more stable (low-variance)
estimate of the test error, even if it’s slightly more biased
(overestimating the error, which is a safe/conservative estimate).
我们通常更喜欢10 倍交叉验证而不是
LOOCV。它能给出更稳定(低方差)的测试误差估计值,即使它的偏差略大(高估了误差,这是一个安全/保守的估计值)。
The Core Problem &
Scenarios (Slides 47-51)
These slides use three scenarios to show why we need
cross-validation (CV). The goal is to pick the right level of
model flexibility (e.g., the degree of a polynomial or
the complexity of a spline) to minimize the Test MSE
(Mean Squared Error), which we can’t see in real life.
这些幻灯片使用了三种场景来说明为什么我们需要交叉验证
(CV)。目标是选择合适的模型灵活性(例如,多项式的次数或样条函数的复杂度),以最小化测试均方误差(Mean
Squared Error),而这在现实生活中是无法观察到的。
The Curves (Slide 47): This slide is
central.
True Test MSE (Blue) 真实测试均方误差(蓝色):
This is the real error on new data. It has a
U-shape. Error is high for simple models (high bias),
drops as the model fits the data, and rises again for overly complex
models (high variance, or overfitting).
这是新数据的真实误差。它呈U
形**。对于简单模型(高偏差),误差较高;随着模型拟合数据的深入,误差会下降;对于过于复杂的模型(高方差或过拟合),误差会再次上升。
LOOCV (Black Dashed) & 10-Fold CV (Orange)
LOOCV(黑色虚线)和 10 倍 CV(橙色): These are our
estimates of the true test MSE. Notice how closely they track
the blue curve. The ‘x’ marks the minimum of the CV curve, which is our
best guess for the model with the minimum test MSE.
这些是我们对真实测试 MSE
的估计。请注意它们与蓝色曲线的吻合程度。“x”标记 CV
曲线的最小值,这是我们对具有最小测试 MSE
的模型的最佳猜测。
Scenario 1 (Slide 48): The true relationship is
non-linear. The right-hand plot shows that the test MSE (red curve) is
high for the simple linear model (blue square), but lower for the more
flexible smoothing splines (teal squares). CV helps us find the “sweet
spot.”
真实的关系是非线性的。右侧图表显示,对于简单的线性模型(蓝色方块),测试
MSE(红色曲线)较高,而对于更灵活的平滑样条函数(蓝绿色方块),测试 MSE
较低。CV 帮助我们找到“最佳点”。
Scenario 2 (Slide 49): The true relationship is
linear. Here, the test MSE (red curve) is
lowest for the simplest model (the linear one, blue square). CV
correctly identifies this, and its error estimate (blue square) is
lowest for that model.
真实的关系是线性的。在这里,对于最简单的模型(线性模型,蓝色方块),测试
MSE(红色曲线)最低。CV
正确地识别了这一点,并且其误差估计(蓝色方块)是该模型中最低的。
Scenario 3 (Slide 50): The true relationship is
highly non-linear. The linear model (orange) is a very
poor fit. The test MSE (red curve) is minimized by the most flexible
model (teal square). CV again finds this.
真实的关系是高度非线性的。线性模型(橙色)拟合度很差。测试
MSE(红色曲线)被最灵活的模型(蓝绿色方块)最小化。CV
再次发现了这一点。
Key Takeaway (Slide 51): We use CV to find the
tuning parameter (like polynomial degree) that
minimizes the test error. We care less about the actual value
of the CV error and more about where its minimum is. 我们使用
CV
来找到最小化测试误差的调整参数(例如多项式次数)。我们不太关心
CV 误差的实际值,而更关心它的最小值。
CV for Classification
(Slides 55-61)
This section shifts from regression (predicting a number, using MSE)
to classification (predicting a category, like “blue” or “orange”).
本节从回归(使用 MSE
预测数字)转向分类(预测类别,例如“蓝色”或“橙色”)。
New Error Metric (Slide 55): We can’t use MSE. A
natural choice is the classification error rate.
我们不能使用 MSE。一个自然的选择是分类错误率。
\(Err_i = I(y_i \neq
\hat{y}_i^{(i)})\)
This is an indicator function: it is
1 if the prediction for the \(i\)-th data point (when trained
without it) is wrong, and 0 if it’s correct.
如果对第 \(i\)
个数据点的预测(在没有它的情况下训练时)错误,则为
1;如果正确,则为 0。
The final CV error is just the average of these 0s and 1s, giving
the total fraction of misclassified points: \(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n}
Err_i\) 最终的 CV 误差就是这些 0 和 1
的平均值,即错误分类点的总比例:\(CV_{(n)} =
\frac{1}{n} \sum_{i=1}^{n} Err_i\)
The Example (Slides 56-61):
Slides 56-58: We are shown a “true” (but unknown)
non-linear boundary (purple dashed line) separating two classes. We then
try to estimate this boundary using logistic regression with
different polynomial degrees (degree 1, 2, 3, 4).
我们看到了一条“真实”(但未知)的非线性边界(紫色虚线),它将两个类别分开。然后,我们尝试使用不同次数(1、2、3、4
次)的逻辑回归来估计这条边界。
Slides 59-60: This is a crucial point. In this
simulated example, we do know the true test error
rates. The true errors are [0.201, 0.197, 0.160,
0.162]. The lowest error is for the 3rd-degree polynomial. But in a
real-world problem, we can never know these true
errors.
这一点至关重要。在这个模拟示例中,我们确实知道真实的测试错误率。真实误差为
[0.201, 0.197, 0.160,
0.162]。最小误差出现在三次多项式中。但在实际问题中,我们永远无法知道这些真实误差。
Slide 61 (The Solution): This is the most important
image. It shows how CV solves the problem from slide 60.展示了
CV 如何解决幻灯片 60 中的问题。
Brown Curve (Test Error): This is the
true test error (from slide 59). We can’t see this in practice.
Its minimum is at degree 3. 这是真实的测试误差(来自幻灯片
59)。我们在实践中看不到它。它的最小值在 3 次方处。
Black Curve (10-fold CV Error): This is what we
can calculate. It’s our estimate of the test error.
Crucially, its minimum is also at degree 3.
This proves that CV successfully found the best model (degree 3)
without ever seeing the true test error. The same logic is
shown for the KNN classifier on the right.
The slides show how to manually implement K-fold CV. This is
great for understanding, even though libraries like
GridSearchCV can do this automatically.
KNN Regression (Slide 52):
kfold = KFold(n_splits=10, ...): Creates an object that
knows how to split the data into 10 folds.
for n_k in neighbors:: This is the outer
loop to test different \(K\)
values (e.g., \(K\)=1, 2, 3…).
for train_index, test_index in kfold.split(X):: This is
the inner loop. For a single\(K\), it loops 10 times.
Inside the inner loop:
It splits the data into a 9-fold training set (X_train)
and a 1-fold test set (X_test).
It trains a KNeighborsRegressor on
X_train.
It makes predictions on X_test and calculates the error
(mean_squared_error).
cv_errors.append(np.mean(mse_errors_k)): After the
inner loop finishes 10 runs, it averages the 10 error scores for that
\(K\) and stores it.
The final plot shows cv_errors
vs. neighbors, letting you pick the \(K\) with the lowest average error.
This code is almost identical, but with three key differences:
The model is LogisticRegression.
It uses PolynomialFeatures to create new features
(\(X^2, X^3,\) etc.) inside
the loop.
The error metric is log_loss (a common, more sensitive
metric than the simple 0/1 error rate).
The plot on slide 64 shows the 10-fold CV error (using Log Loss)
vs. the Degree of the Polynomial. The minimum is clearly at
Degree = 3, matching the finding from slide 61.
Answering the Key
Questions (Slides 54 & 65)
Slide 65 asks two critical questions, which are answered directly by
the concepts on Slide 54 (Bias and variance
trade-off).
Q1:
How does K affect the bias and variance of the CV error?
This refers to \(K\) in K-fold CV
(not to be confused with \(K\) in KNN).
K 如何影响 CV 误差的偏差和方差?
Bias:
LOOCV (K = n): This has very low
bias. The model is trained on \(n-1\) samples, which is almost the
full dataset. So, the error estimate \(CV_{(n)}\) is an almost-unbiased estimate
of the true test error. ** 它的偏差非常低。该模型基于
\(n-1\)
个样本进行训练,这几乎是整个数据集。因此,误差估计 \(CV_{(n)}\)
是对真实测试误差的几乎无偏估计。
K-Fold (K < n, e.g., K=10): This has
slightly higher bias. The models are trained on, for
example, 90% of the data. Because they are trained on less data, they
might perform slightly worse than a model trained on 100% of
the data. This “pessimism” is the source of the bias.
偏差略高。例如,这些模型是基于 90%
的数据进行训练的。由于它们基于较少的数据进行训练,因此它们的性能可能会比基于
100% 数据进行训练的模型略差。这种“悲观”正是偏差的根源。
Variance:
LOOCV (K = n): This has very high
variance. You are training \(n\) models that are almost
identical (they only differ by one data point). Averaging \(n\) highly-correlated error estimates
doesn’t reduce the variance much. This makes the final \(CV_{(n)}\) estimate unstable.
这种模型的方差非常高**。您正在训练 \(n\)
个几乎相同的模型(它们只有一个数据点不同)。对 \(n\)
个高度相关的误差估计取平均值并不能显著降低方差。这使得最终的 \(CV_{(n)}\) 估计值不稳定。
K-Fold (K < n, e.g., K=10): This has
much lower variance. The 10 models are trained on more
different “chunks” of data (they overlap less). Their error estimates
are less correlated, and averaging 10 less-correlated numbers gives a
much more stable (low-variance) final estimate.
这种模型的方差非常低**。这 10
个模型基于更多不同的数据“块”进行训练(它们重叠较少)。它们的误差估计值相关性较低,对
10
个相关性较低的数取平均值可以得到更稳定(低方差)的最终估计值。
Conclusion (The Trade-off): We prefer K-fold
CV (K=5 or 10) over LOOCV. It gives a much more stable
(low-variance) estimate, and we are willing to accept a tiny increase in
bias to get it. 我们更喜欢K 倍交叉验证(K=5 或
10),而不是单倍交叉验证。它能给出更稳定(低方差)的估计值,并且我们愿意接受偏差的轻微增加来获得它。
Q2:
Does Cross Validation over-estimate or under-estimate the true test
error?
交叉验证会高估还是低估真实测试误差?
Based on the bias discussion above:
Cross-validation (especially K-fold) generally over-estimates
the true test error. 交叉验证(尤其是 K
倍交叉验证)通常会高估真实测试误差。
Reasoning: 1. The “true test error” is the error of
a model trained on the entire dataset (\(n\) samples). 2. K-fold CV trains its
models on subsets of the data (e.g., \(n \times (K-1)/K\) samples). 3. Since these
models are trained on less data, they are (on average) slightly
worse than the final model trained on all the data. 4. Because the CV
models are slightly worse, their error rates will be slightly
higher. 5. Therefore, the final CV error score is a slightly
“pessimistic” or high estimate. This is considered a good thing, as it’s
a conservative estimate of how our model will perform.
理由: 1.
“真实测试误差”是指在整个数据集(\(n\) 个样本)上训练的模型的误差。 2. K
折交叉验证 (K-fold CV) 在数据子集上训练其模型(例如,\(n \times (K-1)/K\) 个样本)。 3.
由于这些模型基于较少的数据进行训练,因此它们(平均而言)比基于所有数据训练的最终模型略差。
4. 由于 CV 模型略差,其错误率会略高。 5. 因此,最终的 CV
错误率是一个略微“悲观”或偏高的估计。这被认为是一件好事,因为它是对模型性能的保守*估计。
6. Summary of Bootstrap
Bootstrap is a resampling technique used to estimate
the uncertainty (like standard error or confidence
intervals) of a statistic. Its key idea is to treat your
original data sample as a proxy for the true population. It
then simulates the process of drawing new samples by instead
sampling with replacement from your original
sample. Bootstrap
是一种重采样技术,用于估计统计数据的不确定性(例如标准误差或置信区间)。其核心思想是将原始数据样本视为真实总体的替代样本。然后,它通过从原始样本中进行有放回的抽样来模拟抽取新样本的过程。
The Problem
You have a single data sample (e.g., \(n=100\) people) and you calculate a
statistic, like the sample mean (\(\bar{x}\)) or a regression coefficient
(\(\hat{\beta}\)). You want to know how
accurate this statistic is. How much would it vary if you could
repeat your experiment many times? This variation is measured by the
standard error (SE). 您有一个数据样本(例如,\(n=100\)
人),并计算一个统计数据,例如样本均值 (\(\bar{x}\)) 或回归系数 (\(\hat{\beta}\))。您想知道这个统计数据的准确度。如果可以多次重复实验,它会有多少变化?这种变化可以用标准误差
(SE) 来衡量。
The Bootstrap Solution
Since you can’t re-run the whole experiment, you simulate it
using the one sample you have.
由于您无法重新运行整个实验,因此您可以使用现有的一个样本进行“模拟”。
The Process: 1. Original Sample (\(Z\)) 原始样本 (\(Z\)): You have your one dataset
with \(n\) observations. 2.
Bootstrap Sample (\(Z^{*1}\))
Bootstrap 样本 (\(Z^{*1}\)):
Create a new dataset of size \(n\) by randomly pulling observations from
your original sample with replacement. (This means some
original observations will be picked multiple times, and some not at
all). 3. Calculate Statistic (\(\hat{\theta}^{*1}\)) 计算统计量 (\(\hat{\theta}^{*1}\)): Calculate
your statistic of interest (e.g., the mean, \(\hat{\alpha}\), regression coefficients) on
this new bootstrap sample. 4. Repeat 重复: Repeat steps
2 and 3 a large number of times (\(B\),
e.g., \(B=1000\)). This gives you \(B\) bootstrap statistics: \(\hat{\theta}^{*1}, \hat{\theta}^{*2}, ...,
\hat{\theta}^{*B}\). 5. Analyze the Bootstrap
Distribution 分析自举分布: This collection of \(B\) statistics is your “bootstrap
distribution.” * Standard Error 标准误差: The
standard deviation of this bootstrap distribution is
your estimate of the standard error of your original
statistic. * Confidence Interval 置信区间: A 95%
confidence interval can be found by taking the 2.5th and 97.5th
percentiles of this bootstrap distribution.
Why use it? It’s powerful because it doesn’t rely on
strong theoretical assumptions (like data being normally distributed).
It can be applied to almost any statistic, even very complex
ones (like the prediction from a KNN model), for which a simple
mathematical formula for standard error doesn’t exist.
它非常强大,因为它不依赖于严格的理论假设(例如数据服从正态分布)。它几乎可以应用于任何统计数据,即使是非常复杂的统计数据(例如
KNN 模型的预测),因为这些统计数据没有简单的标准误差数学公式。
Mathematical Understanding
The core idea is to use the empirical distribution
(your sample) as an estimate for the true population
distribution.
其核心思想是使用经验分布(你的样本)来估计真实的总体分布。
Example: Estimating \(\alpha\)
Your slides provide an example of finding the \(\alpha\) that minimizes the variance of a
portfolio, \(var(\alpha X +
(1-\alpha)Y)\). 用于计算使投资组合方差最小化的 \(\alpha\),即 \(var(\alpha X + (1-\alpha)Y)\)。
True Population Parameter (\(\alpha\)) 真实总体参数 (\(\alpha\)): The true\(\alpha\) is a function of the
population variances and covariance: 真实\(\alpha\)
是总体方差和协方差的函数: \[\alpha
= \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 -
2\sigma_{XY}}\] We can never know this value exactly unless we
know the entire population.
除非我们了解整个总体,否则我们永远无法准确知道这个值。
Sample Statistic (\(\hat{\alpha}\)) 样本统计量 (\(\hat{\alpha}\)): We
estimate\(\alpha\) using our
sample, creating the statistic \(\hat{\alpha}\) by plugging in our
sample variances and covariance: 我们使用样本估计\(\alpha\),通过代入样本方差和协方差来创建统计量
\(\hat{\alpha}\): \[\hat{\alpha} = \frac{\hat{\sigma}_Y^2 -
\hat{\sigma}_{XY}}{\hat{\sigma}_X^2 + \hat{\sigma}_Y^2 -
2\hat{\sigma}_{XY}}\] This \(\hat{\alpha}\) is just one number
from our single sample. How confident are we in it? We need its standard
error, \(SE(\hat{\alpha})\). 这个 \(\hat{\alpha}\)
只是我们单个样本中的一个数字。我们对它的置信度有多高?我们需要它的标准误差,\(SE(\hat{\alpha})\)。
Bootstrap Statistic (\(\hat{\alpha}^*\)) 自举统计量 (\(\hat{\alpha}^*\)): We apply the
bootstrap process:
Create a bootstrap sample (by resampling with replacement).
创建一个自举样本(通过放回重采样)。
Calculate \(\hat{\alpha}^*\) using
the sample (co)variances of this new bootstrap sample.
使用这个新自举样本的样本(协)方差计算 \(\hat{\alpha}^*\)。
Repeat \(B\) times to get \(B\) values: \(\hat{\alpha}^{*1}, \hat{\alpha}^{*2}, ...,
\hat{\alpha}^{*B}\). 重复 \(B\)
次,得到 \(B\) 个值:\(\hat{\alpha}^{*1}, \hat{\alpha}^{*2}, ...,
\hat{\alpha}^{*B}\)。
Estimating the Standard Error 估算标准误差: The
standard error of our original estimate \(\hat{\alpha}\) is estimated by the
standard deviation of all our bootstrap estimates: 我们原始估计值 \(\hat{\alpha}\)
的标准误差是通过所有自举估计值的标准差来“估算”的: \[SE_{boot}(\hat{\alpha}) = \sqrt{\frac{1}{B-1}
\sum_{j=1}^{B} (\hat{\alpha}^{*j} - \bar{\alpha}^*)^2}\] where
\(\bar{\alpha}^*\) is the average of
all \(B\) bootstrap estimates. \(\bar{\alpha}^*\) 是所有 \(B\) 个自举估计值的平均值。
The slides (p. 73, 77-78) show this visually. The “sampling from
population” histogram (left) is the true sampling distribution,
which we can only create in a simulation. The “Bootstrap” histogram
(right) is the bootstrap distribution created from one sample.
They look very similar, which shows the method works.
“从总体抽样”直方图(左图)是真实的抽样分布,我们只能在模拟中创建它。“自举”直方图(右图)是从一个样本创建的自举分布。它们看起来非常相似,这表明该方法有效。
Code Analysis
R: \(\alpha\) Example (Slides 75 & 77)
Slide 75 (The R code): This is a SIMULATION,
not Bootstrap.
for(i in 1:m){...}: This loop runs m=1000
times.
returns <- rmvnorm(...): Inside the
loop, it draws a brand new sample from the true
population every time.
alpha[i] <- ...: It calculates \(\hat{\alpha}\) for each new sample.
Purpose: This code shows the true sampling
distribution of \(\hat{\alpha}\)
(the “Histogram of alpha”). You can only do this if you know the true
population, as in a simulation.
Slide 77 (The R code): This IS
Bootstrap.
returns <- rmvnorm(...): Outside the
loop, this is done only once to get one
original sample.
for(i in 1:B){...}: This is the bootstrap loop.
sample(1:nrow(returns), n, replace = T): This
is the key line. It randomly selects row numbers with
replacement from the singlereturns
dataset.
returns_boot <- returns[sample(...), ]: This creates
the bootstrap sample.
alpha_bootstrap[i] <- ...: It calculates \(\hat{\alpha}^*\) on the
returns_boot sample.
Purpose: This code generates the bootstrap
distribution (the “Bootstrap” histogram on slide 78) to
estimate the true sampling distribution.
R: Linear Regression
Example (Slides 79 & 81)
Slide 79:
boot.fn <- function(data, index){ ... }: Defines a
function that the boot package needs. It takes data and an
index vector.
lm(mpg~horsepower, data=data, subset=index): This is
the core. It fits a linear model only on the data points
specified by the index. The boot function will
automatically supply this index as a
resampled-with-replacement vector.
boot(Auto, boot.fn, R=1000): This runs the bootstrap.
It calls boot.fn 1000 times, each time with a new resampled
index, and collects the coefficients.
Slide 81:
summary(lm(...)): Shows the standard output. The “Std.
Error” column (e.g., 0.860, 0.006) is calculated using mathematical
theory.
boot.res: Shows the bootstrap output. The “std. error”
column (e.g., 0.841, 0.007) is the standard deviation of the
1000 bootstrap estimates.
Main Point: The standard errors from the bootstrap
are very close to the theoretical ones. This confirms the uncertainty.
If the model assumptions were violated, the bootstrap SE would be more
trustworthy.
The histograms show the bootstrap distributions for the intercept
(t1*) and the slope (t2*). The arrows show the
95% percentile confidence interval.
Python: KNN Regression
Example (Slide 80)
This shows how to get a confidence interval for a single
prediction.
for i in range(n_bootstraps):: The bootstrap loop.
indices = np.random.choice(train_samples.shape[0], train_samples.shape[0], replace=True):
This is the key line in Python (like
sample in R). It gets a new set of indices with
replacement.
X_boot, y_boot = ...: Creates the bootstrap
sample.
model.fit(X_boot, y_boot): A new KNN model is
trained on this bootstrap sample.
bootstrap_preds.append(model.predict(predict_point)):
The model (trained on \(Z^{*i}\)) makes
a prediction for the same fixed point. This is repeated 1000
times.
Result: You get a distribution of
predictions for that one point. The 2.5th and 97.5th percentiles of
this distribution give you a 95% confidence interval for that
specific prediction. 你会得到该点的预测分布。该分布的 2.5
和 97.5 百分位数为该特定预测提供了 95% 的置信区间。
Python: KNN on Auto data
(Slide 82)
BE CAREFUL: This slide does NOT show
Bootstrap. It shows K-Fold Cross-Validation
(CV).
Purpose: The goal here is not to find
uncertainty. The goal is to find the best
hyperparameter (the best value for \(k\), the number of neighbors).
Method:
kf = KFold(n_splits=10): Splits the data into 10 chunks
(“folds”).
for train_index, test_index in kf.split(X):: It loops
10 times. Each time, it trains on 9 chunks and tests on 1 chunk.
Key Difference for Exam:
Bootstrap: Samples with replacement to
estimate uncertainty/standard error.
Cross-Validation: Splits data without
replacement into \(K\) folds to
estimate model performance/prediction error and tune
hyperparameters.
自举法:使用有放回的样本来估计不确定性/标准误差。
交叉验证:将数据无放回地分成 \(K\)
份,以估计模型性能/预测误差并调整超参数。
7.
The mathematical theory of Bootstrap and the extension to
Cross-Validation (CV).
1. Code
Analysis: Bootstrap for a KNN Prediction (Slide 85)
This Python code shows a different use of bootstrap: finding
the confidence interval for a single prediction, not for a
model coefficient.
Goal: To estimate the uncertainty of a KNN model’s
prediction for a specific new data point
(predict_point).
Process:
Train Full Model: A KNN model (knn) is
first trained on the entire dataset. It makes one prediction
(knpred) for predict_point. This is our \(\hat{f}(x_0)\).
Bootstrap Loop
(for i in range(n_bootstraps)):
indices = np.random.choice(...): This is the
core bootstrap step. It creates a new list of indices by
sampling with replacement from the original data.
X_boot, y_boot = ...: This creates the new bootstrap
dataset (\(Z^{*i}\)).
km.fit(X_boot, y_boot): A new KNN model
(km) is trained only on this bootstrap
sample.
bootstrap_preds.append(km.predict(predict_point)): This
newly trained model makes a prediction for the samepredict_point. This value is \(\hat{f}^{*i}(x_0)\).
Analyze Distribution: After 1000 loops,
bootstrap_preds contains 1000 different predictions for the
same point.
Confidence Interval:
np.percentile(bootstrap_preds, [2.5, 97.5]): This finds
the 2.5th and 97.5th percentiles of the 1000 bootstrap predictions.
The resulting [lower_bound, upper_bound] (e.g.,
[13.70, 15.70]) forms the 95% confidence interval for the
prediction.
Histogram Plot: The plot on the right visually
confirms this. It shows the distribution of the 1000 bootstrap
predictions, with the 95% confidence interval marked by the red dashed
lines.
2.
Mathematical Understanding: Why Does Bootstrap Work? (Slides 87-88)
This is the theoretical justification for the entire method. It’s
based on an analogy. 这是整个方法的理论依据。它基于一个类比。
The “True” World (Slide 87,
Top)
Population: There is a true, unknown population
distribution \(F\).
存在一个真实的、未知的总体分布 \(F\)。
Parameter: We want to know a true parameter,
\(\theta\), which is a function of
\(F\) (e.g., the true population mean).
我们想知道一个真实的参数 \(\theta\),它是 \(F\)
的函数(例如,真实的总体均值)。
Sample: We get one sample \(X_1, ..., X_n\) from \(F\). 我们从 \(F\) 中获取一个样本 \(X_1, ..., X_n\)。
Statistic: We calculate our best estimate \(\hat{\theta}\) from our sample. (e.g., the
sample mean \(\bar{x}\)). \(\hat{\theta}\) is our proxy for \(\theta\). 我们从样本中计算出最佳估计值
\(\hat{\theta}\)。(例如,样本均值
\(\bar{x}\))。\(\hat{\theta}\) 是 \(\theta\) 的替代值。
The Problem: We want to know the accuracy of
\(\hat{\theta}\). How much would \(\hat{\theta}\) vary if we could draw many
samples? We want the sampling distribution of \(\hat{\theta}\) around \(\theta\), specifically the distribution of
the error: \((\hat{\theta} - \theta)\).
我们想知道 \(\hat{\theta}\)
的准确率。如果我们可以抽取多个样本,\(\hat{\theta}\) 会有多少变化?我们想要 \(\hat{\theta}\) 围绕 \(\theta\) 的
抽样分布,具体来说是误差的分布:\((\hat{\theta} - \theta)\)。
CLT: The Central Limit Theorem states that \(\sqrt{n}(\hat{\theta} - \theta)
\xrightarrow{\text{dist}} N(0, Var_F(\theta))\).
The Catch: This is UNKNOWN
because we don’t know \(F\).这是未知的,因为我们不知道
\(F\)。
The “Bootstrap” World
(Slide 87, Bottom)
Population: We pretend our original sample
is the population. We call its distribution the “empirical
distribution,” \(\hat{F}_n\).
我们假设原始样本就是总体。我们称其分布为“经验分布”,即
\(\hat{F}_n\)。
Parameter: In this new world, the “true” parameter
is our original statistic, \(\hat{\theta}\) (which is a function of
\(\hat{F}_n\)).
在这个新世界中,“真实”参数是我们原始的统计量 \(\hat{\theta}\)(它是 \(\hat{F}_n\) 的函数)。
Sample: We draw many bootstrap samples
\(X_1^*, ..., X_n^*\)from \(\hat{F}_n\) (i.e., sampling with
replacement from our original sample). 我们从 \(\hat{F}_n\)* 中抽取 许多 自举样本
\(X_1^*, ...,
X_n^*\)(即从原始样本中进行 有放回 抽样)。
Statistic: From each bootstrap sample, we calculate
a bootstrap statistic, \(\hat{\theta}^*\).
从每个自举样本中,我们计算一个 自举统计量,即 \(\hat{\theta}^*\)。
The Solution: We can now empirically find
the distribution of \(\hat{\theta}^*\)
around \(\hat{\theta}\). We look at the
distribution of the bootstrap error: \((\hat{\theta}^* - \hat{\theta})\).
我们现在可以 凭经验 找到 \(\hat{\theta}^*\) 围绕 \(\hat{\theta}\)
的分布。我们来看看自举误差的分布:\((\hat{\theta}^* - \hat{\theta})\)。
CLT: The CLT also states that \(\sqrt{n}(\hat{\theta}^* - \hat{\theta})
\xrightarrow{\text{dist}} N(0, Var_{\hat{F}_n}(\theta))\).
The Power: This distribution is
ESTIMABLE! We just run the bootstrap \(B\) times and we get \(B\) values of \(\hat{\theta}^*\). We can then calculate
their variance, standard deviation, and percentiles directly.
这个分布是可估计的!我们只需运行 \(B\) 次自举程序,就能得到 \(B\) 个 \(\hat{\theta}^*\)
值。然后我们可以直接计算它们的方差、标准差和百分位数。
The Core Approximation (Slide
88)
The entire method relies on the assumption that the
(knowable) bootstrap distribution is a good approximation of the
(unknown) true sampling distribution.
整个方法依赖于以下假设:(已知的)自举分布能够很好地近似(未知的)真实抽样分布。
The distribution of the bootstrap error approximates the
distribution of the true error.
自举误差的分布近似于真实误差的分布。
\[\text{distribution of }
\sqrt{n}(\hat{\theta}^* - \hat{\theta}) \approx \text{distribution of }
\sqrt{n}(\hat{\theta} - \theta)\]
This is why: * The standard deviation of the \(\hat{\theta}^*\) values is our estimate for
the standard error of \(\hat{\theta}\).
值的标准差是我们对 \(\hat{\theta}\)
的标准误差的估计值。 * The percentiles
of the \(\hat{\theta}^*\) distribution
(e.g., 2.5th and 97.5th) can be used to build a confidence
interval for the true parameter \(\theta\).
分布的百分位数(例如,第 2.5 个和第 97.5
个)可用于为真实参数 \(\theta\)
建立置信区间。
3. Extension:
Cross-Validation (CV) Analysis
CV for
Hyperparameter Tuning (Slide 84) 超参数调优的 CV
This plot is the result of the 10-fold CV code shown in the
previous set of slides (slide 82). * Purpose: To find
the optimal hyperparameter \(k\)
(number of neighbors) for the KNN model. * X-axis:
Number of Neighbors (\(k\)). *
Y-axis: CV Error (Mean Squared Error). *
Analysis: * Low \(k\) (e.g., \(k=1,
2\)): High error. The model is too complex and
overfitting to the training data. * High \(k\) (e.g., \(k>40\)): Error slowly
increases. The model is too simple and underfitting
(e.g., averaging too many neighbors). * Optimal \(k\): The “sweet spot” is at the
bottom of the “U” shape, around \(k
\approx 20-30\), which gives the lowest CV error.
This is a subtle but important theoretical point. * Our
Goal: We want to know the test error of our final
model (\(\hat{f}^{\text{full}}\)),
which we will train on the full dataset (all \(n\) observations).
我们想知道最终模型 (\(\hat{f}^{\text{full}}\))
的测试误差,我们将在完整数据集(所有 \(n\) 个观测值)上训练该模型。 * What
CV Measures:\(k\)-fold CV
does not test the final model. It tests \(k\) different models (\(\hat{f}^{(k)}\)), each trained on a
smaller dataset (of size \(\frac{k-1}{k} \times n\)). \(k\) 倍 CV 不测试最终模型。它测试了
\(k\) 个不同的模型 (\(\hat{f}^{(k)}\)),每个模型都基于一个较小的数据集(大小为
\(\frac{k-1}{k} \times
n\))进行训练。
The Logic:
Models trained on less data generally perform
worse than models trained on more data.
基于较少数据训练的模型通常比基于较多数据训练的模型表现更差。
The CV error is the average error of models trained on \(\frac{k-1}{k} n\) observations. CV
误差是使用 \(\frac{k-1}{k} n\)
个观测值训练的模型的平均误差。
The “true test error” is the error of the model trained on \(n\) observations. “真实测试误差”是使用
\(n\) 个观测值训练的模型的误差。
Conclusion: Since the CV models are trained on
smaller datasets, they will, on average, have a slightly higher error
than the final model. Therefore, the CV error score is a
slightly pessimistic estimate (it over-estimates) the true test
error of the final model. 由于 CV
模型是在较小的数据集上训练的,因此它们的平均误差会略高于最终模型。因此,CV
误差分数是一个略微悲观的估计(它高估了)最终模型的真实测试误差。
Correction of CV Error
(Slides 90-91)
Theory (Slide 91): Advanced theory suggests the
expected test error \(R(n)\) behaves
like \(R(n) = R^* + c/n\), where \(R^*\) is the irreducible error and \(n\) is the sample size. This formula
mathematically confirms that error decreases as sample size
\(n\)increases.
高级理论表明,预期测试误差 \(R(n)\)
的行为类似于 \(R(n) = R^* + c/n\),其中
\(R^*\) 是不可约误差,\(n\)
是样本量。该公式从数学上证实了误差会随着样本量 \(n\) 的增加而减小。
R Code (Slide 90): The cv.glm
function from the boot library automatically provides
this.
cv.err$delta: This output vector contains two
values.
[1] 24.23151 (Raw CV Error): This is the standard
Leave-One-Out CV (LOOCV) error.
[2] 24.23114 (Adjusted CV Error): This is a
bias-corrected estimate that accounts for the overestimation problem.
It’s slightly lower, representing a more accurate guess for the error of
the final model trained on all \(n\) data points.
# The “Correction of CV Error” extension.
Summary
This section provides a deeper mathematical look at why
k-fold cross-validation (CV) slightly over-estimates
the true test error. 本节从数学角度更深入地阐述了 为什么 k
折交叉验证 (CV) 会略微高估真实测试误差。
The Overestimation 高估: CV trains on \(\frac{k-1}{k}\) of the data, which is
less than the full dataset (size \(n\)). Models trained on less data are
generally worse. Therefore, the average error from CV (\(CV_k\)) is slightly higher (more
pessimistic) than the true error of the final model trained on all \(n\) data (\(R(n)\)). CV 训练的数据为 \(\frac{k-1}{k}\),小于完整数据集(大小为
\(n\))。使用较少数据训练的模型通常更差。因此,CV
的平均误差 (\(CV_k\))
略高于(更悲观地)基于所有 \(n\)
个数据训练的最终模型的真实误差 (\(R(n)\))。
A Simple Correction 简单修正: A mathematical
formula, \(\tilde{CV_k} = \frac{k-1}{k} \cdot
CV_k\), is proposed to “correct” this overestimation.
The Critical Flaw 关键缺陷: This correction is
derived assuming the irreducible error (\(R^*\)) is
zero.此修正是在假设不可约误差 (\(R^*\))
为零的情况下得出的。
The Takeaway 要点 (Code Analysis): The Python
code demonstrates a real-world scenario where there is noise
(noise_std = 0.5), meaning \(R^*
> 0\). In this case, the simple correction
fails—it produces an error (0.217) that is less
accurate and further from the true error (0.272) than the
original raw CV error (0.271).
Exam Conclusion: For most real-world problems (which
have noise), the raw \(k\)-fold
CV error is a better and more reliable estimate of the true
test error than the simple (and flawed) correction.
对于大多数实际问题(包含噪声),原始 \(k\) 倍 CV
误差比简单(且有缺陷的)修正方法更能准确、可靠地估计真实测试误差。
Mathematical Understanding
This section explains the theory of why\(CV_k > R(n)\) and derives the simple
correction. 本节解释了为什么 \(CV_k >
R(n)\),并推导出简单的修正方法。
Assumed Error Behavior 假设误差行为: We assume
the test error \(R(n)\) for a model
trained on \(n\) data points behaves
like: 我们假设基于 \(n\)
个数据点训练的模型的测试误差 \(R(n)\)
的行为如下: \[R(n) = R^* +
\frac{c}{n}\]
\(R^*\): The irreducible
error (the “noise floor” you can never beat).
不可约误差(即你永远无法克服的“本底噪声”)。
\(c/n\): The model variance, which
decreases as sample size \(n\)increases. 模型方差,随着样本量 \(n\) 的增加而减小。
Test Error vs. CV Error 测试误差 vs. CV
误差:
Test Error of Interest: This is the error of our
final model trained on all \(n\) points: \[R(n) = R^* + \frac{c}{n}\]
感兴趣的测试误差:这是我们在所有 \(n\)
个点上训练的最终模型的误差:
k-fold CV Error: This is the average error of \(k\) models, each trained on a smaller
sample of size \(n' =
(\frac{k-1}{k})n\).
The Overestimation 高估: Let’s compare \(CV_k\) and \(R(n)\): \[CV_k
\approx R^* + \left(\frac{k}{k-1}\right) \frac{c}{n}\]\[R(n) = R^* + \left(\frac{k-1}{k-1}\right)
\frac{c}{n}\] Since \(k >
(k-1)\), the factor \(\left(\frac{k}{k-1}\right)\) is
greater than 1. This means the \(CV_k\) error term is larger than the \(R(n)\) error term. Thus: \(CV_k > \text{Test error of interest }
R(n)\) 由于 \(k >
(k-1)\),因子 \(\left(\frac{k}{k-1}\right)\)大于
1。这意味着 \(CV_k\)
误差项大于 \(R(n)\) 误差项。因此:
\(CV_k > \text{目标测试误差 }
R(n)\)
Deriving the (Flawed) Correction
推导(有缺陷的)修正: This correction makes a strong
assumption: \(R^* \approx 0\)
(the model is perfectly specified, and there is no noise).
此修正基于一个强假设:\(R^* \approx
0\)(模型完全正确,且无噪声)。
If \(R^* = 0\), then \(R(n) \approx \frac{c}{n}\)
If \(R^* = 0\), then \(CV_k \approx \frac{ck}{(k-1)n}\)
Now, look at the ratio between them: \[\frac{R(n)}{CV_k} \approx \frac{c/n}{ck/((k-1)n)}
= \frac{c}{n} \cdot \frac{(k-1)n}{ck} = \frac{k-1}{k}\]
This gives us the correction formula by isolating \(R(n)\): 通过分离 \(R(n)\),我们得到了校正公式: \[R(n) \approx \left(\frac{k-1}{k}\right) \cdot
CV_k\] This corrected version is denoted \(\tilde{CV_k}\).这个校正版本表示为 \(\tilde{CV_k}\)。
Code Analysis (Slides 92-93)
The Python code is an experiment designed to test the
correction formula.
Goal: Compare the “Raw CV Error” (\(CV_k\)), the “Corrected CV Error” (\(\tilde{CV_k}\)), and the “True Test Error”
(\(R(n)\)) in a realistic
setting.
Key Setup:
def f(x): Defines the true, underlying function \(y = x^2 + 15\sin(x)\).
noise_std = 0.5: This is the most important
line. It adds significant random noise to the data. This
ensures that the irreducible error \(R^*\) is large and \(R^* > 0\).
y = f(...) + np.random.normal(...): Creates the noisy
training data (the blue dots).
CV Calculation (Standard K-Fold):
kf = KFold(...): Sets up 5-fold CV (\(k=5\)).
for train_index, val_index in kf.split(x):: This is the
standard loop. It trains on 4 folds and validates on 1 fold.
cv_error = np.mean(cv_mse_list): Calculates the
raw \(CV_5\) error.
This is the first result (e.g., 0.2715).
Correction Calculation:
correction_factor = (k_splits - 1) / k_splits: This is
\(\frac{k-1}{k}\), which is \(4/5 = 0.8\).
corrected_cv_error = correction_factor * cv_error: This
applies the flawed formula from the math section (\(0.2715 \times 0.8\)). This is the second
result (e.g., 0.2172).
“True” Test Error Calculation:
knn.fit(x, y): Trains the final model on the
entire noisy dataset.
n_test = 1000: Creates a new, large test set
to estimate the true error.
true_test_error = mean_squared_error(...): Calculates
the error of the final model on this new test set. This is our best
estimate of \(R(n)\) (e.g.,
0.2725).
Analysis of Results (Slide 93):
Raw 5-Fold CV MSE: 0.2715
True test error: 0.2725
Corrected 5-Fold CV MSE: 0.2172
The Raw CV Error (0.2715) is an excellent estimate
of the True Test Error (0.2725). The Corrected Error (0.2172) is
much worse. This experiment proves that when noise
(\(R^*\)) is present, the simple
correction formula should not be used.
Classification is a type of supervised machine
learning where the goal is to predict a
categorical or qualitative response. Unlike regression
where you predict a continuous numerical value (like a price or
temperature), classification assigns an input to a specific category or
class.
分类是一种监督式机器学习,其目标是预测分类或定性响应。与预测连续数值(例如价格或温度)的回归不同,分类将输入分配到特定的类别或类别。
Key characteristics:
Goal: Predict the class of a subject based on
input features.
Output (Response): The output is a category,
such as ‘Yes’/‘No’, ‘Spam’/‘Not Spam’, or
‘High’/‘Medium’/‘Low’.
Applications: Common examples include email spam
detectors, medical diagnosis (e.g., virus carrier vs. non-carrier), and
fraud detection.
应用:常见示例包括垃圾邮件检测器、医学诊断(例如,病毒携带者与非病毒携带者)和欺诈检测。
The example used in the slides is a credit card Default
dataset. The goal is to predict whether a customer will
default (‘Yes’ or ‘No’) on their payments based on
their monthly income and account
balance.
## Why Not Use Linear Regression?为什么不使用线性回归?
At first, it might seem possible to use linear regression for
classification. For a binary (two-class) problem like the default
dataset, you could code the outcomes as numbers, for example:
Default = ‘No’ => \(y = 0\)
Default = ‘Yes’ => \(y =
1\)
You could then fit a standard linear regression model: \(Y \approx \beta_0 + \beta_1 X\). In this
context, we would interpret the prediction \(\hat{y}\) as the probability of
default, so we’d be modeling \(P(Y=1|X) =
\beta_0 + \beta_1 X\).
However, this approach has two major problems:
然而,这种方法有两个主要问题: 1. The Output Is Not a
Probability A linear model can produce outputs that are less
than 0 or greater than 1. This doesn’t make sense for a probability,
which must always be between 0 and 1.
The image below is the most important one for understanding this
issue. The left plot shows a linear regression line fit to the 0/1
default data. You can see the line goes below 0 and would eventually go
above 1 for higher balances. The right plot shows a logistic regression
curve, which always stays between 0 and 1.
Left (Linear Regression): The straight blue line
predicts probabilities < 0 for low balances.
Right (Logistic Regression): The S-shaped blue
curve correctly constrains the probability output between 0 and 1.
2. It Doesn’t Work for Multi-Class Problems If you
have more than two categories (e.g., ‘mild’, ‘moderate’, ‘severe’), you
might code them as 0, 1, and 2. A linear regression model would
incorrectly assume that the “distance” between ‘mild’ and ‘moderate’ is
the same as the distance between ‘moderate’ and ‘severe’, which is
usually not a valid assumption.
Instead of modeling the response \(y\) directly, logistic regression models
the probability that \(y\) belongs to a particular class. To solve
the issue of the output not being a probability, it uses the
logistic function (also known as the sigmoid
function).
This function takes any real-valued input and squeezes it into an
output between 0 and 1.
The formula for the probability in a logistic regression model is:
\[P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1
+ e^{\beta_0 + \beta_1 X}}\] This S-shaped function, shown in the
right-hand plot above, ensures that the output is always a valid
probability. We can then set a threshold (e.g., 0.5) to make the final
class prediction. If \(P(Y=1|X) >
0.5\), we predict ‘Yes’; otherwise, we predict ‘No’.
## 解决方案:逻辑回归
逻辑回归不是直接对响应 \(y\)
进行建模,而是对 \(y\)
属于特定类别的概率进行建模。为了解决输出不是概率的问题,它使用了逻辑函数(也称为
S 型函数)。
The slides use R to visualize the data. The boxplots are particularly
important because they show which variable is a better predictor.
Balance vs. Default: The boxplots for balance
show a clear difference. The median balance for those who default
(‘Yes’) is much higher than for those who do not (‘No’). This suggests
balance is a strong predictor.
Income vs. Default: The boxplots for income show
a lot of overlap. The median incomes for both groups are very similar.
This suggests income is a weak predictor.
余额
vs. 违约:余额的箱线图显示出明显的差异。违约者(“是”)的余额中位数远高于未违约者(“否”)。这表明余额是一个强有力的预测指标。
收入
vs. 违约:收入的箱线图显示出很大的重叠。两组的收入中位数非常相似。这表明收入是一个弱的预测指标。
Here’s how you could perform similar analysis and modeling in Python
using seaborn and scikit-learn.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score
# Assume 'default_data.csv' has columns: 'default' (Yes/No), 'balance', 'income' # You would load your data like this: # df = pd.read_csv('default_data.csv')
# --- 1. Data Visualization (like the slides) --- fig, axes = plt.subplots(1, 2, figsize=(14, 5)) fig.suptitle('Predictor Analysis for Default')
# Boxplot for Balance sns.boxplot(ax=axes[0], x='default', y='balance', data=df) axes[0].set_title('Balance vs. Default Status')
# Boxplot for Income sns.boxplot(ax=axes[1], x='default', y='income', data=df) axes[1].set_title('Income vs. Default Status')
plt.show()
# --- 2. Logistic Regression Modeling ---
# Convert categorical 'default' column to 0s and 1s df['default_encoded'] = df['default'].apply(lambda x: 1if x == 'Yes'else0)
# Define features (X) and target (y) X = df[['balance', 'income']] y = df['default_encoded']
# Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train the logistic regression model model = LogisticRegression() model.fit(X_train, y_train)
# Make predictions on new data # For example, a person with a $2000 balance and $50,000 income new_customer = [[2000, 50000]] predicted_prob = model.predict_proba(new_customer) prediction = model.predict(new_customer)
print(f"Customer data: Balance=2000, Income=50000") print(f"Probability of No Default vs. Default: {predicted_prob}") # [[P(No), P(Yes)]] print(f"Final Prediction (0=No, 1=Yes): {prediction}")
2. the
mathematical foundation of logistic regression
This set of slides explains the mathematical foundation of logistic
regression, how its parameters are estimated using Maximum Likelihood
Estimation (MLE), and how an iterative algorithm called Newton-Raphson
is used to perform this estimation.
2.1
The Logistic Regression Model: From Probabilities to
Log-Odds逻辑回归模型:从概率到对数几率
The core of logistic regression is transforming a linear model into a
valid probability. This is done using the logistic
function, also known as the sigmoid function.
逻辑回归的核心是将线性模型转换为有效的概率。这可以通过逻辑函数(也称为
S 型函数)来实现。 #### Key Mathematical Formulas
Probability of Class 1: The model assumes the
probability of an observation \(\mathbf{x}\) belonging to class 1 is given
by the sigmoid function: \[
P(y=1|\mathbf{x}) = \frac{1}{1 + \exp(-\beta^T \mathbf{x})} =
\frac{\exp(\beta^T \mathbf{x})}{1 + \exp(\beta^T \mathbf{x})}
\] This function always outputs a value between 0 and 1, making
it perfect for modeling probabilities.
Odds: The odds are the ratio of the probability
of an event happening to the probability of it not happening. \[
\text{Odds} = \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \exp(\beta^T
\mathbf{x})
\]
Log-Odds (Logit): By taking the natural
logarithm of the odds, we get a linear relationship with the predictors.
This is called the logit transformation. \[
\text{logit}(P(y=1|\mathbf{x})) =
\log\left(\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})}\right) = \beta^T
\mathbf{x}
\] This final equation is the heart of the model. It states that
the log-odds of the outcome are a linear function of the predictors.
This provides a great interpretation: a one-unit increase in a predictor
\(x_j\) changes the log-odds by \(\beta_j\).
2.2
Fitting the Model: Maximum Likelihood Estimation (MLE)
拟合模型:最大似然估计 (MLE)
Unlike linear regression, which uses least squares to find the
best-fit line, logistic regression uses Maximum Likelihood
Estimation (MLE). The goal of MLE is to find the parameter
values (the \(\beta\) coefficients)
that maximize the probability of observing the actual data that we have.
与使用最小二乘法寻找最佳拟合线的线性回归不同,逻辑回归使用最大似然估计
(MLE)。MLE
的目标是找到使观测到实际数据的概率最大化的参数值(\(\beta\) 系数)。
Likelihood Function: This is the joint
probability of observing all the data points in our sample. Assuming
each observation is independent, it’s the product of the individual
probabilities:
1.似然函数:这是观测到样本中所有数据点的联合概率。假设每个观测值都是独立的,它是各个概率的乘积:
\[
L(\beta) = \prod_{i=1}^{n} P(y_i|\mathbf{x}_i)
\] A clever way to write this for a binary (0/1) outcome is:
\[
L(\beta) = \prod_{i=1}^{n} \frac{\exp(y_i \beta^T \mathbf{x}_i)}{1 +
\exp(\beta^T \mathbf{x}_i)}
\]
Log-Likelihood Function: Products are difficult
to work with mathematically, so we work with the logarithm of the
likelihood, which turns the product into a sum. Maximizing the
log-likelihood is the same as maximizing the likelihood.
对数似然函数:乘积在数学上很难处理,所以我们使用似然的对数,将乘积转化为和。最大化对数似然与最大化似然相同。
\[
\ell(\beta) = \log(L(\beta)) = \sum_{i=1}^{n} \left[ y_i \beta^T
\mathbf{x}_i - \log(1 + \exp(\beta^T \mathbf{x}_i)) \right]
\]Key Takeaway: The slides correctly state that
there is no explicit formula to solve for the \(\hat{\beta}\) that maximizes this function.
We must find it using a numerical optimization algorithm.
没有明确的公式来求解最大化该函数的\(\hat{\beta}\)。我们必须使用数值优化算法来找到它。
2.3 The
Algorithm: Newton-Raphson 算法:牛顿-拉夫森算法
The slides introduce the Newton-Raphson algorithm as
the method to find the optimal \(\hat{\beta}\). It’s an efficient iterative
algorithm for finding the roots of a function (i.e., where \(f(x)=0\)).
How does this apply to logistic regression? To
maximize the log-likelihood function \(\ell(\beta)\), we need to find the point
where its derivative (gradient) is equal to zero. So, Newton-Raphson is
used to solve \(\frac{d\ell(\beta)}{d\beta} =
0\).
The algorithm starts with an initial guess, \(x^{old}\), and iteratively refines it using
the following update rule, which is based on a Taylor series
approximation: \[
x^{new} = x^{old} - \frac{f(x^{old})}{f'(x^{old})}
\] where \(f'(x)\) is the
derivative of \(f(x)\). You repeat this
step until the value of \(x\)
converges.
Important
Image: Newton-Raphson Example (\(x^3 - 4 =
0\))
[Image showing iterations of Newton-Raphson]
This slide is a great illustration of the algorithm’s power. *
Goal: Find \(x\) such
that \(f(x) = x^3 - 4 = 0\). *
Function:\(f(x) = x^3 -
4\) * Derivative:\(f'(x) = 3x^2\) * Update
Rule:\(x^{new} = x^{old} -
\frac{(x^{old})^3 - 4}{3(x^{old})^2}\) Starting with a guess of
\(x^{old} = 2\), the algorithm
converges to the true answer (\(4^{1/3}
\approx 1.5874\)) in just 4 steps.
# Newton-Raphson method defnewton_raphson(x0, tol=1e-10, max_iter=100): x = x0 # Start with the initial guess for i inrange(max_iter): fx = f(x) # Calculate f(x_old) fpx = f_prime(x) # Calculate f'(x_old)
if fpx == 0: # Cannot divide by zero print("Zero derivative. No solution found.") returnNone
# This is the core update rule x_new = x - fx / fpx
# Check if the change is small enough to stop ifabs(x_new - x) < tol: print(f"Converged to {x_new} after {i+1} iterations.") return x_new
# Update x for the next iteration x = x_new print("Exceeded maximum iterations. No solution found.") returnNone
The slides show that with a good initial guess
(x0 = 0.5), the algorithm converges quickly. With a bad one
(x0 = 50), it still converges but takes many more steps.
This highlights the importance of the starting point. The slides also
show an implementation of Gradient Descent, another
popular optimization algorithm which uses the update rule
x_new = x - learning_rate * gradient.
Provide
a great case study on logistic regression, particularly on the important
concept of confounding variables. Here’s a summary covering the math,
code, and key insights.
Core
Concept: Logistic Regression 📈 # 核心概念:逻辑回归 📈
Logistic regression is a statistical method used for binary
classification, which means predicting an outcome that can only
be one of two things (e.g., Yes/No, True/False, 1/0).
In this example, the goal is to predict the probability that a
customer will default on a loan (Yes or No) based on
factors like their account balance, income,
and whether they are a student.
The core of logistic regression is the sigmoid (or logistic)
function, which takes any real-valued number and squishes it to
a value between 0 and 1, representing a probability.
\(\hat{P}(Y=1|X)\) is the predicted
probability of the outcome being “Yes” (e.g., default).
\(\beta_0\) is the intercept.
\(\beta_1, ..., \beta_p\) are the
coefficients for each input variable (\(X_1,
..., X_p\)). The model’s job is to find the best values for these
\(\beta\) coefficients.
3.1 How the Model
“Learns” (Mathematical Foundation)
The slides show that the model’s coefficients (\(\beta\)) are found using an algorithm like
Newton-Raphson. This is an iterative process to find
the values that maximize the log-likelihood function.
Think of this as finding the coefficient values that make the observed
data most
probable.这是一个迭代过程,用于查找最大化对数似然函数的值。可以将其视为查找使观测数据概率最大的系数值。
The key slide for this is the one titled “Newton-Raphson Iterative
Algorithm”. It shows the formulas for: * The Gradient
(\(\nabla\ell\)): The direction of the
steepest ascent of the log-likelihood function. * The
Hessian (\(H\)): The
curvature of the log-likelihood function.
梯度 (\(\nabla\ell\)):对数似然函数最陡上升的方向。
黑森矩阵 (\(H\)):对数似然函数的曲率。
The updating rule is given by: \[
\beta^{new} = \beta^{old} - H^{-1}\nabla\ell
\] This formula is used repeatedly until the coefficient values
stop changing significantly, meaning the algorithm has converged to the
best fit. This process is also referred to as Iteratively
Reweighted Least Squares (IRLS).
此公式反复使用,直到系数值不再发生显著变化,这意味着算法已收敛到最佳拟合值。此过程也称为迭代重加权最小二乘法
(IRLS)。
3.2 The Puzzle: A Tale of Two
Models 🕵️♂️
The most important story in these slides is how the effect of being a
student changes depending on the model. This is a classic example of a
confounding variable.
Model 1:
Simple Logistic Regression (Default vs. Student)
When predicting default using only student status, the model
is: default ~ student
From the slides, the coefficients are: * Intercept (\(\beta_0\)): -3.5041 * student[Yes] (\(\beta_1\)): 0.4049
(positive)
The equation for the log-odds is: \[
\log\left(\frac{P(\text{default})}{1-P(\text{default})}\right) = -3.5041
+ 0.4049 \times (\text{is\_student})
\]
Conclusion: The positive coefficient (0.4049)
suggests that students are more likely to default than
non-students. The slides calculate the probabilities: * Student
Default Probability: 4.31% * Non-Student Default
Probability: 2.92%
3.3
Model 2: Multiple Logistic Regression (Default vs. All Variables) 模型
2:多元逻辑回归(违约 vs. 所有变量)
When we add balance and income to the
model, it becomes: default ~ student + balance + income
From the slides, the new coefficients are: * Intercept (\(\beta_0\)): -10.8690 * balance (\(\beta_1\)): 0.0057 * income (\(\beta_2\)): 0.0030 * student[Yes] (\(\beta_3\)): -0.6468
(negative)
The Shocking Twist! The coefficient for
student[Yes] is now negative.
Conclusion: When we control for balance and income,
students are actually less likely to default
than non-students with the same balance and income.
Why the
Change? The Confounding Variable Explained
The key insight, explained on the slide with multi-colored text
bubbles, is that students, on average, have higher credit card
balances.
In the simple model, the student variable was
inadvertently capturing the risk associated with having a high
balance. The model mistakenly concluded “being a student
causes default.”
In the multiple model, the balance variable properly
accounts for the risk from a high balance. With that effect isolated,
the student variable can show its true, underlying
relationship with default, which is negative.
This demonstrates why it’s crucial to consider multiple relevant
variables to avoid drawing incorrect conclusions. The most
important slides are the ones that present this paradox and its
explanation.
import pandas as pd import statsmodels.api as sm from sklearn.linear_model import LogisticRegression
# Assume 'Default' is a pandas DataFrame with columns: # 'default' (0/1), 'student' (0/1), 'balance', 'income'
# --- Using statsmodels (recommended for interpretation) ---
# Prepare the data # For statsmodels, we need to manually add the intercept X_simple = Default[['student']] X_simple = sm.add_constant(X_simple) y = Default['default']
# Simple Model clf_simple = LogisticRegression().fit(X_simple_sk, y_sk) print(f"\nSimple Model Intercept (scikit-learn): {clf_simple.intercept_}") print(f"Simple Model Coefficient (scikit-learn): {clf_simple.coef_}")
# Multiple Model clf_multiple = LogisticRegression().fit(X_multiple_sk, y_sk) print(f"\nMultiple Model Intercept (scikit-learn): {clf_multiple.intercept_}") print(f"Multiple Model Coefficients (scikit-learn): {clf_multiple.coef_}")
4
Making Predictions and the Decision Boundary 🎯进行预测和决策边界
Once the model is trained (i.e., we have the coefficients \(\hat{\beta}\)), we can make predictions.
一旦模型训练完成(即,我们有了系数 \(\hat{\beta}\)),我们就可以进行预测了。 ##
Math Behind Predictions
The model outputs the log-odds, which can be
converted into a probability. A key concept is the decision
boundary, which is the threshold where the model is uncertain
(probability = 50%).
模型输出对数概率,它可以转换为概率。一个关键概念是决策边界,它是模型不确定的阈值(概率
= 50%)。
The Estimated Odds: The core output of the
linear part of the model is the exponential of the linear equation,
which gives the odds of the outcome being ‘Yes’ (or 1).
估计概率:模型线性部分的核心输出是线性方程的指数,它给出了结果为“是”(或
1)的概率。
The Decision Rule: We classify a new observation
\(\mathbf{x}_0\) by comparing its
predicted odds to a threshold \(\delta\).
决策规则:我们通过比较新观测值 \(\mathbf{x}_0\) 的预测概率与阈值 \(\delta\) 来对其进行分类。
Predict \(y=1\) if \(\exp(\hat{\beta}^\top \mathbf{x}_0) >
\delta\)
Predict \(y=0\) if \(\exp(\hat{\beta}^\top \mathbf{x}_0) <
\delta\) A common default is \(\delta=1\), which means we predict ‘Yes’ if
the probability is greater than 0.5.
The Linear Boundary: The decision boundary
itself is where the odds are exactly equal to the threshold. By taking
the logarithm, we see that this boundary is a linear
equation. This is why logistic regression is called a
linear classifier.
线性边界:决策边界本身就是概率恰好等于阈值的地方。取对数后,我们发现这个边界是一个线性方程。这就是逻辑回归被称为线性分类器的原因。
\[
\]$$\hat{\beta}^\top \mathbf{x} = \log(\delta)
\[
\]$$For \(\delta=1\), the
boundary is simply \(\hat{\beta}^\top
\mathbf{x} = 0\).
This concept is visualized perfectly in the slide titled “Linear
Classifier,” which shows a straight line neatly separating two classes
of data points.
题为“线性分类器”的幻灯片完美地展示了这一概念,它展示了一条直线,将两类数据点巧妙地分隔开来。
Visualizing the Confounding
Effect
The most important image in this set is Figure 4.3,
as it visually explains the confounding puzzle from the first set of
slides.
Right Panel (Boxplots): This shows that
students (Yes) tend to have higher credit card balances
than non-students (No). This is the source of the confounding.
Left Panel (Default Rates):
The dashed lines show the overall default
rates. The orange line (students) is higher than the blue line
(non-students). This matches our simple model
(default ~ student).
The solid S-shaped curves show the probability of
default as a function of balance. For any given balance, the
blue curve (non-students) is slightly higher than the orange curve
(students). This means that at the same level of debt, students
are less likely to default. This matches our multiple
regression model
(default ~ student + balance + income).
This single figure brilliantly illustrates how a variable can appear
to have one effect in isolation but the opposite effect when controlling
for a confounding factor. *
右侧面板(箱线图):这表明学生(是)的信用卡余额往往高于非学生(否)。这就是混杂效应的根源。
* 左图(违约率): *
虚线显示总体违约率。橙色线(学生)高于蓝色线(非学生)。这与我们的简单模型(“违约
~ 学生”)相符。 * S
形实线显示违约概率与余额的关系。对于任何给定的余额,蓝色曲线(非学生)略高于橙色曲线(学生)。这意味着在相同的债务水平下,学生违约的可能性较小。这与我们的多元回归模型(“违约
~ 学生 + 余额 + 收入”)相符。
What happens if the data can be perfectly separated by a straight
line? 如果数据可以用一条直线完美分离,会发生什么?
One might think this is the ideal scenario, but it causes a problem
for the logistic regression algorithm. The model will try to find
coefficients that make the probabilities for each class as close to 1
and 0 as possible. To do this, the magnitude of the coefficients (\(\hat{\beta}\)) must grow infinitely large.
人们可能认为这是理想情况,但它会给逻辑回归算法带来问题。模型会尝试找到使每个类别的概率尽可能接近
1 和 0 的系数。为此,系数 (\(\hat{\beta}\)) 的大小必须无限大。
The slide “Non-convergence for perfectly separated case” demonstrates
this:
The Code: It generates two distinct,
non-overlapping clusters of data points using Python’s
scikit-learn.
Parameter Estimates Graph: It shows the
Intercept, Coefficient 1, and
Coefficient 2 values increasing or decreasing without limit
as the algorithm runs through more iterations. They never converge to a
stable value.
Decision Boundary Graph: The decision boundary
itself might look reasonable, but the underlying coefficients are
unstable.
Key Takeaway: If your logistic regression model
fails to converge, the first thing you should check for is perfect
separation in your training data.
关键要点:如果您的逻辑回归模型未能收敛,您应该检查的第一件事就是训练数据是否完美分离。
Code Understanding
The slides provide useful code snippets in both R and Python.
R Code (Plotting Predictions)
This code generates the plot with the two S-shaped curves (one for
students, one for non-students) showing the probability of default as
balance increases.
1 2 3 4 5 6 7 8 9 10 11 12
//# Create a data frame for prediction with a range of balances //# One version for students, one for non-students Default.st <- data.frame(balance=seq(500,2500, by=1), student="Yes") Default.nonst <- data.frame(balance=seq(500,2500, by=1), student="No")
//# Use the trained multiple regression model (glmod3) to predict probabilities pred.st <- predict(glmod3, Default.st, type="response") pred.nonst <- predict(glmod3, Default.nonst, type="response")
//# Plot the results plot(Default.st$balance, pred.st, type="l", col="red", ...)// Students lines(Default.nonst$balance, pred.nonst, col="blue", ...)// Non-students
Python Code
(Visualizing the Decision Boundary)
This Python code uses scikit-learn and
matplotlib to create the plot showing the linear decision
boundary.
# Import necessary libraries import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression
# 1. Generate synthetic data with two classes X, y = make_classification(...)
# 2. Initialize and fit the logistic regression model model = LogisticRegression() model.fit(X, y)
# 3. Create a mesh grid of points to make predictions over the entire plot area xx, yy = np.meshgrid(...)
# 4. Predict the probability for each point on the grid probs = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# 5. Plot the decision boundary where the probability is 0.5 plt.contour(xx, yy, probs.reshape(xx.shape), levels=[0.5], ...)
# 6. Scatter plot the actual data points plt.scatter(X[:, 0], X[:, 1], c=y, ...) plt.show()
Other Important Remarks
The “Remarks” slide briefly mentions some key extensions:
Probit Model: An alternative to logistic
regression that uses the cumulative distribution function (CDF) of the
standard normal distribution instead of the sigmoid function. The
results are often very similar.
Softmax Regression: An extension of logistic
regression used for multi-class classification (when there are more than
two possible outcomes).
Probit
模型:逻辑回归的替代方法,它使用标准正态分布的累积分布函数
(CDF) 代替 S 型函数。结果通常非常相似。
Softmax
回归:逻辑回归的扩展,用于多类分类(当存在两个以上可能结果时)。
5.
Here is a summary of the slides on Linear Discriminant Analysis (LDA),
including the key mathematical formulas, visual explanations, and how to
implement it in Python.
The
Main Idea: Classification Using Probabilities 使用概率进行分类
Linear Discriminant Analysis (LDA) is a classification method. For a
given input x, it calculates the probability that
x belongs to each class and then assigns
x to the class with the highest
probability.
It does this using Bayes’ Theorem, which provides a
formula for the posterior probability \(P(Y=k
| X=x)\), or the probability that the class is \(k\) given the input \(x\). 线性判别分析 (LDA)
是一种分类方法。对于给定的输入 x,它计算
x 属于每个类别的概率,然后将 x
分配给概率最高的类别。
\(p_k(x)\) is the posterior
probability we want to maximize.
\(\pi_k = P(Y=k)\) is the
prior probability of class \(k\) (how common the class is overall).
\(f_k(x) = f(x|Y=k)\) is the
class-conditional probability density function of
observing input \(x\) if it belongs to
class \(k\).
To classify a new observation \(x\),
we simply find the class \(k\) that
makes \(p_k(x)\) the largest.
为了对新的观察值 \(x\)
进行分类,我们只需找到使 \(p_k(x)\)
最大的类别 \(k\) 即可。
Key Assumptions of LDA
LDA’s power comes from a specific, simplifying assumption about the
data’s distribution. LDA
的强大之处在于它对数据分布进行了特定的简化假设。
Gaussian Distribution: LDA assumes that the data
within each class \(k\) follows a
p-dimensional multivariate normal (or Gaussian) distribution, denoted as
\(X|Y=k \sim \mathcal{N}(\mu_k,
\Sigma)\).
Common Covariance: A crucial assumption is that
all classes share the same covariance matrix\(\Sigma\). This means that while the classes
may have different centers (means, \(\mu_k\)), their shape and orientation
(covariance, \(\Sigma\)) are
identical.
高斯分布:LDA 假设每个类 \(k\) 中的数据服从 p
维多元正态(或高斯)分布,表示为 \(X|Y=k \sim
\mathcal{N}(\mu_k, \Sigma)\)。
The probability density function for a class \(k\) is: \[
f_k(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}} \exp \left( -\frac{1}{2}(x
- \mu_k)^T \Sigma^{-1} (x - \mu_k) \right)
\]
The image above (from your slide “Knowing normal distribution”)
illustrates this. The two “bells” have different centers (different
\(\mu_k\)) but similar shapes. The one
on the right is “tilted,” indicating correlation between variables,
which is captured in the shared covariance matrix \(\Sigma\).
上图(摘自幻灯片“了解正态分布”)说明了这一点。两个“钟”形的中心不同(\(\mu_k\)
不同),但形状相似。右边的钟形“倾斜”,表示变量之间存在相关性,这体现在共享协方差矩阵
\(\Sigma\) 中。
The Math
Behind LDA: The Discriminant Function 判别函数
Since we only need to find the class \(k\) that maximizes the posterior
probability \(p_k(x)\), we can simplify
the math. The denominator in Bayes’ theorem is the same for all classes,
so we only need to maximize the numerator: \(\pi_k f_k(x)\).
由于我们只需要找到使后验概率 \(p_k(x)\)
最大化的类别 \(k\),因此可以简化数学计算。贝叶斯定理中的分母对于所有类别都是相同的,因此我们只需要最大化分子:\(\pi_k f_k(x)\)。 Taking the logarithm
(which doesn’t change which class is maximal) and removing constant
terms gives us the linear discriminant function, \(\delta_k(x)\):
取对数(这不会改变哪个类别是最大值)并移除常数项,得到线性判别函数,\(\delta_k(x)\):
This function is linear in \(x\), which is why the method is called
Linear Discriminant Analysis. The decision boundary between any
two classes, say class \(k\) and class
\(l\), is the set of points where \(\delta_k(x) = \delta_l(x)\), which defines
a linear hyperplane. 该函数关于 \(x\)
是线性的,因此该方法被称为线性判别分析。任意两个类别(例如类别
\(k\) 和类别 \(l\))之间的决策边界是满足 \(\delta_k(x) = \delta_l(x)\)
的点的集合,这定义了一个线性超平面。
The image above (from your “Graph of LDA” slide) is very important. *
Left: The ellipses show the true 95% probability
contours for three Gaussian classes. The dashed lines are the ideal
Bayes decision boundaries, which are perfectly linear because the
assumption of common covariance holds. * Right: This
shows a sample of data points drawn from those distributions. The solid
lines are the LDA decision boundaries calculated from the sample. They
are a very good estimate of the ideal boundaries. 上图(来自您的“LDA
图”幻灯片)非常重要。 *
左图:椭圆显示了三个高斯类别的真实 95%
概率轮廓。虚线是理想的贝叶斯决策边界,由于共同协方差假设成立,因此它们是完美的线性。
*
右图:这显示了从这些分布中抽取的数据点样本。实线是根据样本计算出的
LDA 决策边界。它们是对理想边界的非常好的估计。 ***
Practical
Implementation: Estimating the Parameters 实际应用:估计参数
In a real-world scenario, we don’t know the true parameters (\(\mu_k\), \(\Sigma\), \(\pi_k\)). Instead, we
estimate them from our training data (\(n\) total samples, with \(n_k\) samples in class \(k\)).
在实际场景中,我们不知道真正的参数(\(\mu_k\)、\(\Sigma\)、\(\pi_k\))。相反,我们根据训练数据(\(n\) 个样本,\(n_k\) 个样本属于 \(k\) 类)来估计它们。
Prior Probability (\(\hat{\pi}_k\)): The proportion of
training samples in class \(k\). \[\hat{\pi}_k = \frac{n_k}{n}\]
Class Mean (\(\hat{\mu}_k\)): The average of the
training samples in class \(k\). \[\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i=k}
x_i\]
Common Covariance (\(\hat{\Sigma}\)): A weighted
average of the sample covariance matrices for each class. This is often
called the “pooled” covariance. \[\hat{\Sigma} = \frac{1}{n-K} \sum_{k=1}^{K}
\sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T\]
We then plug these estimates into the discriminant function to get
\(\hat{\delta}_k(x)\) and classify a
new observation \(x\) to the class with
the largest score. 然后,我们将这些估计值代入判别函数,得到 \(\hat{\delta}_k(x)\),并将新的观测值 \(x\) 归类到得分最高的类别。 ***
Evaluating Performance
After training the model, we evaluate its performance using a
confusion matrix.
训练模型后,我们使用混淆矩阵来评估其性能。
This matrix shows the true classes versus the predicted classes. *
Diagonal elements (9644, 81) are correct predictions. *
Off-diagonal elements (23, 252) are errors.
该矩阵显示了真实类别与预测类别的对比。 * 对角线元素
(9644, 81) 表示正确预测。 * 非对角线元素 (23, 252)
表示错误预测。
From this matrix, we can calculate key metrics: * Overall
Error Rate: Total incorrect predictions / Total predictions. *
Example: \((252 + 23) / 10000 =
2.75\%\) * Sensitivity (True Positive Rate):
Correctly predicted positives / Total actual positives. It answers: “Of
all the people who actually defaulted, what fraction did we catch?” *
Example: \(81 / 333 = 24.3\%\). The
sensitivity is \(1 - 75.7\% = 24.3\%\).
* Specificity (True Negative Rate): Correctly predicted
negatives / Total actual negatives. It answers: “Of all the people who
did not default, what fraction did we correctly identify?” * Example:
\(9644 / 9667 = 99.8\%\). The
specificity is \(1 - 0.24\% =
99.8\%\).
The example in your slides shows a high error rate for “default”
people (75.7%) because the classes are unbalanced—there
are far fewer defaulters. This highlights the importance of looking at
class-specific metrics, not just the overall error rate.
Python Code Understanding
In Python, you can easily implement LDA using the
scikit-learn library. The code conceptually mirrors the
steps we discussed.
import numpy as np from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, classification_report
# Assume you have your data X (features) and y (labels) # X = features (e.g., balance, income) # y = labels (e.g., 0 for 'no-default', 1 for 'default')
# 1. Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 2. Create an instance of the LDA model lda = LinearDiscriminantAnalysis()
# 3. Fit the model to the training data # This is where the model calculates the estimates: # - Prior probabilities (pi_k) # - Class means (mu_k) # - Pooled covariance matrix (Sigma) lda.fit(X_train, y_train)
# 4. Make predictions on new, unseen data predictions = lda.predict(X_test)
# 5. Evaluate the model's performance print("Confusion Matrix:") print(confusion_matrix(y_test, predictions))
LinearDiscriminantAnalysis() creates the classifier
object.
lda.fit(X_train, y_train) is the core training step
where the model learns the \(\hat{\pi}_k\), \(\hat{\mu}_k\), and \(\hat{\Sigma}\) parameters from the
data.
lda.predict(X_test) uses the learned discriminant
function \(\hat{\delta}_k(x)\) to
classify each sample in the test set.
confusion_matrix and classification_report
are tools to evaluate the results, just like in the slides.
6.
Here is a summary of the provided slides on Linear Discriminant Analysis
(LDA), focusing on mathematical concepts, Python code interpretation,
and key visuals.
Core Concept: LDA for
Classification
Linear Discriminant Analysis (LDA) is a classification method that
models the probability that an observation belongs to a certain class.
It works by finding a linear combination of features that best separates
two or more classes.
The decision is based on Bayes’ theorem. For a given
observation with features \(X=x\), LDA
calculates the posterior probability, \(p_k(x) = Pr(Y=k|X=x)\), for each class
\(k\). This is the probability that the
observation belongs to class \(k\)
given its features. 线性判别分析 (LDA)
是一种分类方法,它对观测值属于某个类别的概率进行建模。它的工作原理是找到能够最好地区分两个或多个类别的特征的线性组合。
By default, the Bayes classifier assigns an observation to the class
with the highest posterior probability. For a binary (two-class) problem
like ‘Yes’ vs. ‘No’, this means:
默认情况下,贝叶斯分类器将观测值分配给后验概率最高的类别。对于像“是”与“否”这样的二分类问题,这意味着:
Assign to ‘Yes’ if \(Pr(Y=\text{Yes}|X=x)
> 0.5\)
Assign to ‘No’ otherwise
Modifying the Decision
Threshold
The default 0.5 threshold isn’t always optimal. In many real-world
scenarios, the cost of one type of error is much higher than another.
For example, in credit card default prediction: 默认的 0.5
阈值并非总是最优的。在许多实际场景中,一种错误的代价远高于另一种。例如,在信用卡违约预测中:
False Negative: Incorrectly classifying a person
who will default as someone who won’t. (The bank loses money).
False Positive: Incorrectly classifying a person
who won’t default as someone who will. (The bank loses a potential
customer).
A bank might decide that missing a defaulter is much worse than
denying a good customer. To catch more potential defaulters, they can
lower the probability threshold.
银行可能会认为错过一个违约者比拒绝一个优质客户更糟糕。为了捕捉更多潜在的违约者,他们可以降低概率阈值。
A modified rule could be: \[
Pr(\text{default}=\text{Yes}|X=x) > 0.2
\] This makes the model more “sensitive” to flagging potential
defaulters, even at the cost of misclassifying more non-defaulters.
降低阈值会提高敏感度,但会降低特异性。
This decision leads to a trade-off between two key
performance metrics: * Sensitivity (True Positive
Rate): The ability to correctly identify positive cases. (e.g.,
Correctly identified defaulters / Total actual defaulters).
* Specificity (True Negative Rate): The ability to
correctly identify negative cases. (e.g.,
Correctly identified non-defaulters / Total actual non-defaulters).
Lowering the threshold increases sensitivity but
decreases specificity. ## Python Code Explained
The slides show how to implement and adjust LDA using Python’s
scikit-learn library.
Basic LDA Implementation
1 2 3 4 5 6 7 8 9
# Import the necessary library from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Initialize and train the LDA model lda = LinearDiscriminantAnalysis() lda_train = lda.fit(X, y)
# Get predictions using the default 0.5 threshold y_pred = lda.predict(X)
This code trains an LDA model and makes predictions using the
standard 50% probability boundary.
Adjusting the Prediction
Threshold
To use a custom threshold (e.g., 0.2), you don’t use the
.predict() method. Instead, you get the class probabilities
with .predict_proba() and apply the threshold manually.
1 2 3 4 5 6 7 8 9 10 11 12
# 1. Get the probabilities for each class # lda.predict_proba(X) returns an array like [[P(No), P(Yes)], ...] # We select the second column [:, 1] for the 'Yes' class probability lda_probs = lda.predict_proba(X)[:, 1]
# 2. Define a custom threshold threshold = 0.2
# 3. Apply the threshold to get new predictions # This creates a boolean array (True where prob > 0.2, else False) # We then convert True/False to 'Yes'/'No' labels lda_pred1 = np.where(lda_probs > threshold, "Yes", "No")
This is the core technique for tuning the classifier’s behavior to
meet specific business needs, as demonstrated on slides 55 and 56 for
both LDA and Logistic Regression.
Important Images to
Understand
Confusion Matrix (Slide 49): This table is crucial.
It breaks down the model’s predictions into True Positives, True
Negatives, False Positives, and False Negatives. All key metrics like
error rate, sensitivity, and specificity are calculated from this
matrix. 混淆矩阵(幻灯片
49):这张表至关重要。它将模型的预测分解为真阳性、真阴性、假阳性和假阴性。所有关键指标,例如错误率、灵敏度和特异性,都基于此矩阵计算得出。
LDA Decision Boundaries (Slide 51): This plot
provides a powerful visual intuition. It shows the data points for two
classes and the decision boundary line. The different parallel lines
show how changing the threshold from 0.5 to 0.1 or 0.9 shifts the
boundary, making the model classify more or fewer points into the
minority class. LDA 决策边界(幻灯片
51):这张图提供了强大的视觉直观性。它展示了两个类别的数据点和决策边界线。不同的平行线显示了将阈值从
0.5 更改为 0.1 或 0.9
时边界如何移动,从而使模型将更多或更少的点归入少数类
Error Rate Tradeoff Curve (Slide 53): This graph is
the most important for understanding the business implication of
changing the threshold. It clearly shows that as the threshold changes,
the error rate for one class goes down while the error rate for the
other goes up. The overall error is minimized at a certain point, but
that may not be the optimal point from a business perspective.
错误率权衡曲线(幻灯片
53):这张图对于理解更改阈值的业务含义至关重要。它清楚地表明,随着阈值的变化,一个类别的错误率下降,而另一个类别的错误率上升。总体误差在某个点达到最小,但从业务角度来看,这可能并非最佳点。
ROC Curve (Slides 54 & 55): The Receiver
Operating Characteristic (ROC) curve plots Sensitivity vs. (1 -
Specificity) for all possible thresholds. An ideal classifier
has a curve that “hugs” the top-left corner, indicating high sensitivity
and high specificity. It’s a standard way to visualize and compare the
overall performance of different classifiers. ROC 曲线(幻灯片
54 和 55): 接收者操作特性 (ROC)
曲线绘制了所有可能阈值的灵敏度与(1 -
特异性)的关系。理想的分类器曲线“紧贴”左上角,表示高灵敏度和高特异性。这是可视化和比较不同分类器整体性能的标准方法。
7.
Here is a summary of the provided slides on Linear and Quadratic
Discriminant Analysis, including the key formulas, Python code
equivalents, and explanations of the important concepts.
Key Goal:
Classification
Both Linear Discriminant Analysis (LDA) and
Quadratic Discriminant Analysis (QDA) are
classification algorithms. Their main goal is to find a decision
boundary to separate different classes (e.g., “default” vs. “not
default”) in the data. 线性判别分析 (LDA) 和
二次判别分析 (QDA)
都是分类算法。它们的主要目标是找到一个决策边界来区分数据中的不同类别(例如,“默认”与“非默认”)。
## Linear Discriminant
Analysis (LDA)
LDA creates a linear decision boundary between
classes. LDA 在类别之间创建线性决策边界。
Core Idea (Fisher’s
Interpretation)
Imagine you have data points for different classes in a 3D space.
Fisher’s idea is to find the best angle to shine a “flashlight” on the
data to project its shadow onto a 2D wall (or a 1D line). The “best”
projection is the one where the shadows of the different classes are
as far apart from each other as possible, while the
shadows within each class are as tightly packed as
possible. 想象一下,你在三维空间中拥有不同类别的数据点。Fisher
的思想是找到最佳角度,用“手电筒”照射数据,将其阴影投射到二维墙壁(或一维线上)。
“最佳”投影是不同类别的阴影彼此之间尽可能远,而每个类别内的阴影尽可能紧密的投影。
Maximize: The distance between the means of the
projected classes (Between-Class Variance).
投影类别均值之间的距离(类间方差)。
Minimize: The spread or variance within each
projected class (Within-Class Variance).
每个投影类别内的扩散或方差(类内方差)。 This is the most important
image for understanding the intuition behind LDA. It shows how
projecting the data onto a specific line (defined by vector
w) can make the two classes clearly separable.
这是理解LDA背后直觉的最重要图像。它展示了如何将数据投影到特定直线(由向量“w”定义)上,从而使两个类别清晰可分。
Key Mathematical
Formulas
To achieve this, LDA maximizes a ratio called the Rayleigh
quotient. LDA最大化一个称为瑞利商的比率。
Within-Class Covariance (\(\hat{\Sigma}_W\)): Measures the
spread of data inside each class. 类内协方差 (\(\hat{\Sigma}_W\)):衡量每个类别内部数据的扩散程度。
\[\hat{\Sigma}_W = \frac{1}{n-K}
\sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i -
\hat{\mu}_k)^\top\]
Between-Class Covariance (\(\hat{\Sigma}_B\)): Measures the
spread between the means of different classes.
类间协方差 (\(\hat{\Sigma}_B\)):衡量不同类别均值之间的差异。
\[\hat{\Sigma}_B = \sum_{k=1}^{K} n_k
(\hat{\mu}_k - \hat{\mu})(\hat{\mu}_k - \hat{\mu})^\top\]
Objective Function: Find the projection vector
\(w\) that maximizes the ratio of
between-class variance to within-class variance.
目标函数:找到投影向量 \(w\),使类间方差与类内方差之比最大化。 \[\max_w \frac{w^\top \hat{\Sigma}_B w}{w^\top
\hat{\Sigma}_W w}\]
LDA’s Main
Assumption
The key assumption of LDA is that all classes share the same
covariance matrix (\(\Sigma\)). They can have different
means (\(\mu_k\)), but their spread and
orientation must be identical. This assumption is what results in a
linear decision boundary. LDA
的关键假设是所有类别共享相同的协方差矩阵 (\(\Sigma\))。它们可以具有不同的均值
(\(\mu_k\)),但它们的散度和方向必须相同。正是这一假设导致了线性决策边界。
## Quadratic Discriminant
Analysis (QDA)
QDA is a more flexible extension of LDA that creates a
quadratic (curved) decision boundary. QDA 是 LDA
的更灵活的扩展,它创建了二次(曲线)决策边界。 ####
Core Idea & Key Assumption
QDA starts with the same principles as LDA but drops the key
assumption. QDA assumes that each class has its own unique
covariance matrix (\(\Sigma_k\)). QDA 的原理与 LDA
相同,但放弃了关键假设。QDA 假设每个类别都有自己独特的协方差矩阵
(\(\Sigma_k\))。
This means each class can have its own spread, shape, and
orientation. This additional flexibility allows for a more complex,
curved decision boundary.
这意味着每个类别可以拥有自己的散度、形状和方向。这种额外的灵活性使得决策边界更加复杂、曲线化。
Key Mathematical
Formula
The classification is made using a discrimination function, \(\delta_k(x)\). We assign a data point \(x\) to the class \(k\) for which \(\delta_k(x)\) is largest. The function for
QDA is: \[\delta_k(x) = -\frac{1}{2}(x -
\mu_k)^\top \Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}\log(|\Sigma_k|) +
\log \pi_k\] The term containing \(x^\top \Sigma_k^{-1} x\) makes this
function a quadratic function of \(x\).
## LDA vs. QDA: The Trade-Off
The choice between LDA and QDA is a classic bias-variance
trade-off. 在 LDA 和 QDA
之间进行选择是典型的偏差-方差权衡。
Use LDA when:
The assumption of a common covariance matrix is reasonable (the
classes have similar shapes).
You have a small amount of training data, as LDA is less prone to
overfitting.
Simplicity is preferred. LDA is less flexible (high bias) but has
lower variance.
假设共同协方差矩阵是合理的(类别具有相似的形状)。
训练数据量较少,因为 LDA 不易过拟合。
简洁是首选。LDA 灵活性较差(偏差较大),但方差较小。
Use QDA when:
The classes have clearly different shapes and spreads (different
covariance matrices).
You have a large amount of training data to properly estimate the
separate covariance matrices for each class.
QDA is more flexible (low bias) but can have high variance, meaning
it might overfit on smaller datasets.
类别具有明显不同的形状和分布(不同的协方差矩阵)。
拥有大量训练数据,可以正确估计每个类别的独立协方差矩阵。
QDA
更灵活(偏差较小),但方差较大,这意味着它可能在较小的数据集上过拟合。
Rule of Thumb: If the class variances are equal or
close, LDA is better. Otherwise, QDA is better.
经验法则: 如果类别方差相等或接近,则 LDA
更佳。否则,QDA 更好。
## Code Understanding
(Python Equivalent)
The slides show code in R. Here’s how you would perform LDA and
evaluate it in Python using the popular scikit-learn
library.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, auc import matplotlib.pyplot as plt
# Assume 'df' is your DataFrame with features and a 'target' column # X = df.drop('target', axis=1) # y = df['target']
# 1. Split data into training and testing sets # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 2. Fit an LDA model (equivalent to lda() in R) lda = LinearDiscriminantAnalysis() lda.fit(X_train, y_train)
# 3. Make predictions (equivalent to predict() in R) y_pred_lda = lda.predict(X_test)
# To fit a QDA model, the process is identical: # qda = QuadraticDiscriminantAnalysis() # qda.fit(X_train, y_train) # y_pred_qda = qda.predict(X_test)
# 4. Create a confusion matrix (equivalent to table()) print("LDA Confusion Matrix:") print(confusion_matrix(y_test, y_pred_lda))
# 5. Plot the ROC Curve (equivalent to the R code for ROC) # Get prediction probabilities for the positive class y_pred_proba = lda.predict_proba(X_test)[:, 1]
The ROC Curve is another important image. It helps
you visualize a classifier’s performance across all possible
classification thresholds. ROC 曲线
是另一个重要的图像。它可以帮助您直观地了解分类器在所有可能的分类阈值下的性能。
The Y-axis is the True Positive
Rate (Sensitivity): “Of all the actual positives, how many did
we correctly identify?”
The X-axis is the False Positive
Rate: “Of all the actual negatives, how many did we incorrectly
label as positive?”
A perfect classifier would have a curve that goes straight up to the
top-left corner (100% TPR, 0% FPR). The diagonal line represents a
random guess. The Area Under the Curve (AUC) summarizes
the model’s performance; a value closer to 1.0 is better.
8.
Here is a summary of the provided slides on Quadratic Discriminant
Analysis (QDA), including the key formulas, code explanations with
Python equivalents, and a guide to the most important images.
## Core Concept: QDA vs. LDA
The main difference between Linear Discriminant Analysis
(LDA) and Quadratic Discriminant Analysis
(QDA) lies in their assumptions about the data.
线性判别分析 (LDA) 和 二次判别分析
(QDA) 的主要区别在于它们对数据的假设。 * LDA
assumes that all classes share the same covariance
matrix (\(\Sigma\)). It models
each class as a normal distribution with a different mean (\(\mu_k\)) but the same shape and
orientation. This results in a linear decision boundary between
classes. 假设所有类别共享相同的协方差矩阵 (\(\Sigma\))。它将每个类别建模为均值不同
(\(\mu_k\))
但形状和方向相同的正态分布。这会导致类别之间出现 线性
决策边界。 * QDA is more flexible. It assumes that each
class \(k\) has its own,
separate covariance matrix (\(\Sigma_k\)). This allows each class’s
distribution to have a unique shape, size, and orientation. This
flexibility results in a quadratic decision boundary (like a
parabola, hyperbola, or ellipse). 更灵活。它假设每个类别 \(k\) 都有其独立的协方差矩阵
(\(\Sigma_k\))。这使得每个类别的分布具有独特的形状、大小和方向。这种灵活性导致了二次决策边界(类似于抛物线、双曲线或椭圆)。
Analogy 💡: Imagine you’re drawing boundaries around
different clusters of stars. LDA gives you only straight lines to
separate the clusters. QDA gives you curved lines (circles, ellipses),
which can create a much better fit if the clusters themselves are
elliptical and point in different directions.
想象一下,你正在围绕不同的星团绘制边界。LDA 只提供直线来分隔星团。QDA
提供曲线(圆形、椭圆形),如果星团本身是椭圆形且指向不同的方向,则可以产生更好的拟合效果。
## The Math Behind QDA
QDA classifies a new observation \(x\) to the class \(k\) that has the highest discriminant
score, \(\delta_k(x)\). The formula for
this score is what makes the boundary quadratic. QDA 将新的观测值 \(x\) 归类到具有最高判别分数 \(\delta_k(x)\) 的类 \(k\)
中。该分数的公式使得边界具有二次项。
The discriminant function for class \(k\) is: \[\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T
\Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}\log(|\Sigma_k|) +
\log(\pi_k)\]
Let’s break it down:
\((x - \mu_k)^T \Sigma_k^{-1}(x -
\mu_k)\): This is a quadratic term (since it involves \(x^T \Sigma_k^{-1} x\)). It measures the
squared Mahalanobis distance from \(x\)
to the class mean \(\mu_k\), scaled by
that class’s specific covariance \(\Sigma_k\).
\(\log(|\Sigma_k|)\): A term that
penalizes classes with larger variance.
\(\log(\pi_k)\): The prior
probability of class \(k\). This is our
initial belief about how likely class \(k\) is, before seeing the data.
\(\log(\pi_k)\):类 \(k\) 的先验概率。这是我们在看到数据之前对类
\(k\) 可能性的初始信念。 Because each
class \(k\) has its own \(\Sigma_k\), the quadratic term doesn’t
cancel out when comparing scores between classes, leading to a quadratic
boundary. 由于每个类 \(k\) 都有其自己的
\(\Sigma_k\),因此在比较类之间的分数时,二次项不会抵消,从而导致二次边界。
Key Trade-off:
If the class variances (\(\Sigma_k\)) are truly different,
QDA is better.
If the class variances are similar, LDA is often
better because it’s less flexible and less likely to overfit,
especially with a small number of training samples.
The slides provide R code for fitting a QDA model and evaluating it.
Below is an explanation of the R code and its equivalent in Python using
the popular scikit-learn library.
R Code (from the slides)
The code uses the MASS library for QDA and the
ROCR library for evaluation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# ######## QDA ##########
# 1. Fit the model on the training data # This formula `Default~.` means "predict 'Default' using all other variables". qda.fit.mod2 <- qda(Default~., data=Default, subset=train.ids)
# 2. Make predictions on the test data # We are interested in the posterior probabilities for the ROC curve qda.fit.pred3 <- predict(qda.fit.mod2, Default_test)$posterior[,2]
# 3. Evaluate using ROC and AUC # 'prediction' and 'performance' are functions from the ROCR library perf <- performance(prediction(qda.fit.pred3, Default_test$Default),"auc")
# 4. Get the AUC value auc_value <- perf@y.values[[1]] # Result from slide: 0.9638683
Python Equivalent
(scikit-learn)
Here’s how you would perform the same steps in Python.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis from sklearn.metrics import roc_auc_score, roc_curve import matplotlib.pyplot as plt
# Assume 'Default' is your DataFrame and 'default' is the target column # (preprocessing 'student' and 'default' columns to numbers) # Default['default_num'] = Default['default'].apply(lambda x: 1 if x == 'Yes' else 0) # X = Default[['balance', 'income', ...]] # y = Default['default_num']
# 1. Split data into training and testing sets # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# 2. Initialize and fit the QDA model qda = QuadraticDiscriminantAnalysis() qda.fit(X_train, y_train)
# 3. Predict probabilities on the test set # We need the probability of the positive class ('Yes') for the AUC calculation y_pred_proba = qda.predict_proba(X_test)[:, 1]
# 4. Calculate the AUC score auc_score = roc_auc_score(y_test, y_pred_proba) print(f"AUC Score for QDA: {auc_score:.7f}")
# You can also plot the ROC curve # fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba) # plt.plot(fpr, tpr) # plt.show()
## Model Evaluation: ROC and
AUC
The slides correctly emphasize using the ROC curve
and the Area Under the Curve (AUC) to compare model
performance.
ROC Curve (Receiver Operating Characteristic):
This plot shows how well a model can distinguish between two classes. It
plots the True Positive Rate (y-axis) against the
False Positive Rate (x-axis) at all possible
classification thresholds. A better model has a curve that is closer to
the top-left corner.
AUC (Area Under the Curve): This is a single
number that summarizes the entire ROC curve.
AUC = 1: Perfect classifier.
AUC = 0.5: A useless classifier (equivalent to
random guessing).
AUC > 0.7: Generally considered an acceptable
model.
The slides show that for the Default dataset,
LDA’s AUC (0.9647) was slightly higher than QDA’s
(0.9639). This suggests that the assumption of a common
covariance matrix (LDA) was a slightly better fit for this particular
test set, possibly because QDA’s extra flexibility wasn’t needed and it
may have slightly overfit the training data.
这表明,对于这个特定的测试集,公共协方差矩阵 (LDA)
的假设拟合度略高,可能是因为 QDA
的额外灵活性并非必需,并且可能对训练数据略微过拟合。
## Key Takeaways and
Important Images
Here’s
a ranking of the most important visual aids in your slides:
Slide 68/69 (Model Assumption & Formula):
These are the most critical slides. They present the
core theoretical difference between LDA and QDA and provide the
mathematical foundation (the discriminant function formula).
Understanding these is key to understanding QDA.
Slide 73 (ROC Comparison): This is the most
important image for practical evaluation. It visually
compares the performance of LDA and QDA side-by-side, making it easy to
see which one performs better on this specific dataset. The concept of
AUC is introduced here as the method for comparison.
Slide 71 (Decision Boundaries with Different
Thresholds): This is an excellent conceptual image. It shows
how the quadratic decision boundary (the curved lines) separates the
data points. It also illustrates how changing the probability threshold
(from 0.1 to 0.5 to 0.9) shifts the boundary, trading off between
precision and recall.
Of course. Here is a summary of the remaining slides, which compare
QDA to other popular classification models like Logistic Regression and
K-Nearest Neighbors (KNN).
Visualizing the Core
Trade-off: LDA vs. QDA
This is the most important concept in these slides. The choice
between LDA and QDA depends entirely on the underlying structure of your
data.
The slide shows two scenarios: 1. Left Plot (\(\Sigma_1 = \Sigma_2\)): When the
true covariance matrices of the classes are the same, the optimal
decision boundary (the Bayes classifier) is a straight line. LDA, which
assumes equal covariances, creates a linear boundary that approximates
this optimal boundary very well. QDA’s flexible, curved boundary is
unnecessarily complex and might overfit the training data. In
this case, LDA is better. 2. Right Plot (\(\Sigma_1 \neq \Sigma_2\)): When
the true covariance matrices are different, the optimal decision
boundary is a curve. QDA’s quadratic model can capture this
non-linearity much better than LDA’s rigid linear model. In this
case, QDA is better.
This perfectly illustrates the bias-variance
tradeoff. LDA has higher bias (it’s less flexible) but lower
variance. QDA has lower bias (it’s more flexible) but higher
variance.
Comparing
Performance on the “Default” Dataset
The slides compare four different models on the same classification
task. Let’s look at their performance using the Area Under the
Curve (AUC), where a higher score is better.
LDA AUC: 0.9647
QDA AUC: 0.9639
Logistic Regression AUC: 0.9645
K-Nearest Neighbors (KNN): The plot shows test
error vs. K. The error is lowest around K=4, but it’s not directly
converted to an AUC score in the slides.
Interestingly, for this particular dataset, LDA, QDA, and Logistic
Regression perform almost identically. This suggests that the decision
boundary for this problem is likely very close to linear, meaning the
extra flexibility of QDA isn’t providing much benefit.
Pros and Cons: Which Model
to Choose?
The final slide asks for a comparison of the models. Here’s a summary
of their key characteristics:
Model
Type
Decision Boundary
Key Pro
Key Con
Logistic Regression
Parametric
Linear
Highly interpretable, no strong
assumptions about data distribution.
More stable than Logistic Regression when
classes are well-separated.
Assumes data is normally distributed with
equal covariance matrices for all classes.
Quadratic Discriminant Analysis
(QDA)
Parametric
Quadratic (Curved)
More flexible than LDA; can model
non-linear boundaries.
Requires more data to estimate parameters
and is more prone to overfitting. Assumes normality.
K-Nearest Neighbors
(KNN)
Non-Parametric
Highly Non-linear
Extremely flexible; makes no assumptions
about the data’s distribution.
Can be slow on large datasets and suffers
from the “curse of dimensionality.” Less interpretable.
Summary of the Comparison:
Linear Models (Logistic Regression & LDA):
Choose these for simplicity, interpretability, and when you believe the
relationship between predictors and the class is linear. LDA often
outperforms Logistic Regression if its normality assumptions are
met.
Non-Linear Models (QDA & KNN): Choose these
when the decision boundary is likely more complex. QDA is a good middle
ground, offering more flexibility than LDA without being as completely
data-driven as KNN. KNN is the most flexible but requires careful tuning
of the parameter K to avoid overfitting or underfitting.
9.
Here is a more detailed, slide-by-slide analysis of the
presentation.
4.6 Four
Classification Methods: Comparison by Simulation
This section (slides 81-87) introduces four classification methods
and systematically compares their performance on six different simulated
datasets. The goal is to see which method works best under different
conditions (e.g., linear vs. non-linear boundaries, normal
vs. non-normal data).
The four methods being compared are: * Logistic
Regression: A linear method that models the log-odds as a
linear function of the predictors. * Linear Discriminant
Analysis (LDA): Another linear method. It also assumes a linear
decision boundary but makes stronger assumptions than logistic
regression (e.g., that data within each class is normally distributed
with a common covariance matrix). * Quadratic Discriminant
Analysis (QDA): A non-linear method. It assumes the log-odds
are a quadratic function, which creates a more flexible, curved
decision boundary. It assumes data within each class is normally
distributed, but without a common covariance matrix. *
K-Nearest Neighbors (KNN): A non-parametric, highly
flexible method. Two versions are tested: * KNN-1 (\(K=1\)): A very flexible (high
variance) model. * KNN-CV: A tuned model where the best
\(K\) is chosen via
cross-validation.
The performance is measured by the test error rate
(lower is better), shown in the boxplots for each scenario.
性能通过测试错误率(越低越好)来衡量,每个场景的箱线图都显示了该错误率。
Scenario 1 (Slide 82):
Setup: A linear decision boundary.
Data is normally distributed with uncorrelated
predictors.
Result:LDA and Logistic Regression perform
best. Their test error rates are low and similar. This is
expected, as the setup perfectly matches their core assumption (linear
boundary). QDA is slightly worse because its extra flexibility (being
quadratic) is unnecessary. KNN-1 is the worst, as its high flexibility
leads to high variance (overfitting).
Setup: Same as Scenario 1 (linear
boundary, normal data), but now the two predictors have
a correlation of 0.5.
Result:Almost no change from
Scenario 1. LDA and Logistic Regression are still the
best. This shows that these linear methods are robust to
correlation between predictors.
Setup: A linear decision boundary,
but the data is drawn from a t-distribution (which is
non-normal and has “heavy tails,” or more extreme outliers).
Result:Logistic Regression is the clear
winner. LDA’s performance gets worse because its assumption of
normality is violated by the t-distribution. QDA’s performance
deteriorates significantly due to the non-normality. This highlights a
key difference: logistic regression is more robust to violations of the
normality assumption.
结果:逻辑回归明显胜出**。LDA 的性能会变差,因为 t
分布违反了其正态性假设。QDA
的性能由于非正态性而显著下降。这凸显了一个关键区别:逻辑回归对违反正态性假设的情况更稳健。
Scenario 4 (Slide 85):
Setup: A quadratic decision
boundary. Data is normally distributed with different
correlations in each class.
Result:QDA is the clear winner by
a large margin. This setup perfectly matches QDA’s assumption (quadratic
boundary from normal data with different covariance structures). All
other methods (LDA, Logistic, KNN) are linear or not flexible enough, so
they perform poorly.
Setup: Another quadratic boundary,
but generated in a different way (using a logistic function of quadratic
terms).
Result:QDA performs best again,
closely followed by the flexible KNN-CV. The linear
methods (LDA, Logistic) have poor performance because they cannot
capture the curve.
Setup: A complex, non-linear
decision boundary (more complex than a simple quadratic curve).
Result: The flexible KNN-CV method is the
winner. Its non-parametric nature allows it to approximate the
complex shape. QDA is not flexible enough and performs worse.
This slide highlights the bias-variance trade-off: the overly simple
KNN-1 is the worst, but the tuned KNN-CV is the best.
This section (slides 88-93) applies Logistic Regression and LDA to
the Smarket dataset from the ISLR package to
predict the stock market’s Direction (Up or Down).
本节(幻灯片 88-93)将逻辑回归和 LDA
应用于“ISLR”包中的“Smarket”数据集,以预测股市的“方向”(上涨或下跌)。
### Data Preparation (Slides 88, 89, 90)
Load Data: The ISLR library is loaded,
and the Smarket dataset is explored. It contains daily
percentage returns (Lag1…Lag5 for the previous
5 days, Today), Volume, and the
Year.
Explore Data: A correlation matrix
(cor(Smarket[,-9])) is computed, and a plot of
Volume over time is generated.
Split Data: The data is split into a training set
(Years 2001-2004) and a test set (Year 2005).
Prediction: Predictions are made on the 2005 test
set.
Results:
Test Error Rate: 0.4404 (or 55.95%
accuracy). This is an improvement.
Confusion Matrix: | | True Down | True Up | | :— |
:— | :— | | Pred Down | 77 | 69 | | Pred
Up | 35 | 71 |
ROC and AUC: The ROC (Receiver Operating
Characteristic) curve is plotted, and the AUC (Area Under the Curve) is
calculated.
AUC Value:0.5584. This is very
close to 0.5 (which represents a random-chance model), indicating that
the model has very weak predictive power, even though its accuracy is
above 50%.
Model 3: LDA (Lag1 & Lag2)
(Slide 92)
Model: LDA is now performed using the same setup:
Lag1 and Lag2 as predictors, trained on the
2001-2004 data.
Prediction: Predictions are made on the 2005 test
set.
lda.pred <- predict(lda.fit, Smarket.2005)
Results:
Test Error Rate: 0.4404 (or 55.95%
accuracy).
Confusion Matrix: | | True Down | True Up | | :— |
:— | :— | | Pred Down | 77 | 69 | | Pred
Up | 35 | 71 |
Observation: The confusion matrix and accuracy are
identical to the logistic regression model.
Final Comparison (Slide 93)
ROC and AUC for LDA: The ROC curve for the LDA
model is plotted.
AUC Value:0.5584.
Main Conclusion: As highlighted in the green box,
“LDA has identical performance as Logistic regression!”
In this specific practical example, using these two predictors, both
linear methods produce the exact same confusion matrix, the same
accuracy (56%), and the same AUC (0.558). This reinforces the
theoretical idea that both are fitting a linear boundary.
The previous slides showed that Logistic Regression and Linear
Discriminant Analysis (LDA) had identical performance
on the Smarket dataset (using Lag1 and Lag2),
both achieving 56% test accuracy and an AUC of 0.558. The analysis now
tests a more flexible method, QDA.
Model 3: QDA (Lag1 &
Lag2) (Slides 94-95)
Model: A Quadratic Discriminant Analysis (QDA)
model is fit on the same training data (2001-2004) using only the
Lag1 and Lag2 predictors.
Prediction: The model is used to predict the market
direction for the 2005 test set.
Results:
Test Accuracy: The model achieves a test accuracy
of 0.5992 (or 60%).
AUC: The Area Under the Curve (AUC) for the QDA
model is 0.562.
Conclusion: As the slide highlights, “QDA
has better test performance than LDA and Logistic
regression!”
Smarket Example Summary
Method
Model Type
Test Accuracy
AUC
Logistic Regression
Linear
~56%
0.558
LDA
Linear
~56%
0.558
QDA
Quadratic
~60%
0.562
This practical example reinforces the lessons from the simulations
(Section 4.6). The two linear methods (LDA, Logistic) had identical
performance. The more flexible, non-linear QDA model performed better,
suggesting that the true decision boundary between “Up” and “Down”
(based on Lag1 and Lag2) is not perfectly
linear.
4.8 Kernel LDA
This new section introduces an even more advanced non-linear method,
Kernel LDA.
The Problem: Linear
Inseparability (Slide 97)
The section starts with a clear visual example. A dataset of two
concentric circles (a “donut” shape) is linearly
inseparable. It is impossible to draw a single straight line to
separate the inner (purple) class from the outer (yellow) class.
The Solution: The
Kernel Trick (Slides 97, 99)
Nonlinear Transformation: The data is “lifted” into
a higher-dimensional feature space using a nonlinear
transformation, \(x \mapsto
\phi(x)\). In the example on the slide, the 2D data is
transformed, and in this new space, the two classes becomelinearly separable.
The “Kernel Trick”: The main idea (from slide 99)
is that we don’t need to explicitly compute this complex transformation
\(\phi(x)\). LDA (based on Fisher’s
approach) only requires inner products of the data points. The “kernel
trick” allows us to replace the inner product in the high-dimensional
feature space (\(x_i^T x_j\)) with a
simple kernel function, \(k(x_i, x_j)\), computed in the original,
low-dimensional space.
An example of such a kernel is the Gaussian (RBF)
kernel: \(k(x_i, x_j) \propto
e^{-\|x_i - x_j\|^2 / \sigma^2}\).
Academic Foundations (Slide
98)
This method is based on foundational academic papers that generalized
linear methods using kernels: * Fisher discriminant analysis
with kernels (Mika, 1999) * Generalized Discriminant
Analysis Using a Kernel Approach (Baudat, 2000) *
Kernel principal component analysis (Schölkopf,
1997)
In short, Kernel LDA is an extension of LDA that uses the kernel
trick to find a linear boundary in a high-dimensional feature space,
which corresponds to a highly non-linear boundary in the original
space.
def parse_qm9_xyz(file_path): """ Parses a QM9 extended XYZ file and returns a standard XYZ string. """ with open(file_path, 'r') as f: lines = f.readlines() # First line is the number of atoms num_atoms = int(lines[0].strip()) # The next line is properties (skip it) # The next num_atoms lines are the coordinates coord_lines = lines[2:2+num_atoms] # Rebuild a standard XYZ format string in memory standard_xyz = f"{num_atoms}\n" standard_xyz += "Comment line\n"# Add a standard comment line for line in coord_lines: parts = line.split() # Keep only the element and the x, y, z coordinates standard_xyz += f"{parts[0]} {parts[1]} {parts[2]} {parts[3]}\n" return standard_xyz
# Path to your data file file_path = "/root/QM9/QM9/Data_for_6095_constitutional_isomers_of_C7H10O2.xyz/dsC7O2H10nsd_0001.xyz"
# 1. Parse the special file format into a standard XYZ string standard_xyz_data = parse_qm9_xyz(file_path)
# 2. ASE reads the standard XYZ data from the string variable # We use io.StringIO to make the string behave like a file atoms = ase.io.read(io.StringIO(standard_xyz_data), format="xyz")
# Normal Initialization of the 3D structure token structure_token = nn.Parameter(torch.Tensor(structure_model.input_dim).unsqueeze(0)) nn.init.normal_(structure_token, mean=0.0, std=0.01) self.structure_token = nn.Parameter(structure_token.squeeze(0))
self.structure_token: 一个可学习的向量
(nn.Parameter)。这个“令牌”不代表任何真实的原子或氨基酸,而是一个抽象的载体。在训练过程中,它将学习如何编码和表示整个蛋白质的全局
3D 结构信息。它就像一个信息信使。
1 2 3
# Linear Transformation between structure to sequential spaces self.structure_linears = nn.ModuleList([...]) self.seq_linears = nn.ModuleList([...])
defforward(self, graph, input, all_loss=None, metric=None): # Build a new protein graph with the 3D token (the lase node) new_graph = self.build_protein_graph_with_3d_token(graph)
# Sequence Output x = self.sequence_model.model.emb_layer_norm_after(x) x = x.transpose(0, 1) # (T, B, E) => (B, T, E)
# last hidden representation should have layer norm applied if (seq_layer_idx + 1) in repr_layers: hidden_representations[seq_layer_idx + 1] = x x = self.sequence_model.model.lm_head(x)
This whiteboard provides a concise but detailed overview of two
important and related simulation techniques in computational physics and
chemistry: the Metropolis Monte Carlo (MC) method and Hamiltonian (or
Hybrid) Monte Carlo (HMC). Here is a detailed breakdown of the concepts
presented.
1. Metropolis Monte Carlo (MC)
Method
The heading “Metropolis MC method” introduces a foundational
algorithm in statistical mechanics. Metropolis Monte Carlo is a method
used to generate a sequence of states for a system, allowing for the
calculation of average properties. 左上角的这一部分介绍了基础的
Metropolis Monte Carlo
算法。它是一种生成状态序列的方法,使得处于任何状态的概率都符合期望的概率分布(在物理学中通常是玻尔兹曼分布)。
Conceptual Diagram: The small box with numbered
sites (0-5) and an arrow showing a move from state 0 to 2, and then to
3, illustrates a “random walk.” In Metropolis MC, the system transitions
from one state to another by making small, random changes.
小方框中标有编号的位点(0-5),箭头表示从状态 0 到状态 2,再到状态 3
的移动,代表“随机游走”。在 Metropolis MC
中,系统通过进行微小的随机变化从一个状态过渡到另一个状态。
Random Number Generation: The notation
rand t \in (0,1) indicates the use of a random number \(t\) drawn from a uniform distribution
between 0 and 1. This is a core component of the algorithm, used to
decide whether to accept or reject a proposed new state. 符号
rand t \in (0,1) 表示使用从 0 到 1
之间的均匀分布中抽取的随机数 \(t\)。这是算法的核心部分,用于决定是否接受或拒绝提议的新状态。
Detailed Balance Condition: The equation \(P_o T(o \to n) = P_n T(n \to o)\) is the
principle of detailed balance. It states that in a system at
equilibrium, the probability of being in an old state (\(o\)) and transitioning to a new state
(\(n\)) is equal to the probability of
being in the new state and transitioning back to the old one. This
condition is crucial because it ensures that the simulation will
eventually sample states according to their correct thermodynamic
probabilities (the Boltzmann distribution). 方程 \(P_o T(o \to n) = P_n T(n \to o)\)
是详细平衡的原理。它指出,在平衡系统中,处于旧状态 (\(o\)) 并转变为新状态 (\(n\))
的概率等于处于新状态并转变回旧状态的概率。此条件至关重要,因为它确保模拟最终将根据正确的热力学概率(玻尔兹曼分布)对状态进行采样。
Acceptance Rate: The note \sim 30\%?
likely refers to the target acceptance rate for an
efficient Metropolis MC simulation. If new states are accepted too often
or too rarely, the exploration of the system’s possible configurations
is inefficient. While the famous optimal acceptance rate for certain
high-dimensional problems is around 23.4%, a range of 20-50% is often
considered effective. 注释“30%?”指的是高效 Metropolis
蒙特卡罗模拟的目标接受率。如果新状态接受过于频繁或过于稀少,系统对可能配置的探索就会变得低效。虽然某些高维问题的最佳接受率约为
23.4%,但通常认为 20-50% 的范围是有效的。
2. Hamiltonian / Hybrid
Monte Carlo (HMC)
The second topic, “Hamiltonian/Hybrid MC (HMC),” is a more advanced
Monte Carlo method that uses principles from classical mechanics to
propose new states more intelligently than the simple random-walk
approach of the standard Metropolis method. This often leads to a much
higher acceptance rate and more efficient exploration of the state
space. 第二个主题“哈密顿/混合蒙特卡罗
(HMC)”是一种更先进的蒙特卡罗方法,它利用经典力学原理,比标准 Metropolis
方法中简单的随机游走方法更智能地提出新状态。这通常会带来更高的接受率和更高效的状态空间探索。
The whiteboard outlines a four-step HMC algorithm:
Step 1: Randomize Velocities The first step is to
randomize the velocities: \(\vec{v}_i \sim
\mathcal{N}(0, k_B T)\). 第一步是随机化速度:\(\vec{v}_i \sim \mathcal{N}(0, k_B T)\)。 *
This step introduces momentum into the system. For each particle \(i\), a velocity vector \(\vec{v}_i\) is randomly drawn from a normal
(Gaussian) distribution with a mean of 0 and a variance related to the
temperature \(T\) and the Boltzmann
constant \(k_B\).
此步骤将动量引入系统。对于每个粒子 \(i\),速度矢量 \(\vec{v}_i\)
会随机地从正态(高斯)分布中抽取,该分布的均值为 0,方差与温度 \(T\) 和玻尔兹曼常数 \(k_B\) 相关。 * The full formula for this
probability distribution, \(f(\vec{v})\), is the
Maxwell-Boltzmann distribution, which is written out
further down the board. 该概率分布的完整公式 \(f(\vec{v})\)
是麦克斯韦-玻尔兹曼分布。
Step 2: Molecular Dynamics (MD) Integration The
board notes this as t=0 \to h \text{ or } mhMD and mentions the Verlet algorithm.
This is the “Hamiltonian dynamics” part of the algorithm. Starting
from the current positions and the newly randomized velocities, the
system’s trajectory is calculated for a short period of time (\(h\) or \(mh\)) using Molecular Dynamics (MD).
这是算法的“哈密顿动力学”部分。从当前位置和新随机化的速度开始,使用分子动力学
(MD) 计算系统在短时间内(\(h\) 或 \(mh\))的轨迹。
The name Verlet refers to the Verlet integration
algorithm, a numerical method used to solve Newton’s equations of
motion. It is popular in MD simulations because it is time-reversible
and conserves energy well over long simulations. 指的是 Verlet
积分算法,这是一种用于求解牛顿运动方程的数值方法。它在 MD
模拟中很受欢迎,因为它具有时间可逆性,并且在长时间模拟中能量守恒效果良好。
Step 3: Calculate Total Energy The third step is to
calculate total energy: \(E_n =
K_n + V_n\). 第三步是“计算总能量”:\(E_n = K_n + V_n\)。 * After the MD
trajectory, the system is in a new state \(n\). The total energy of this new state,
\(E_n\), is calculated as the sum of
its kinetic energy (\(K_n\), from the
velocities) and its potential energy (\(V_n\), from the positions). MD
轨迹之后,系统处于新状态 \(n\)。新状态的总能量 \(E_n\) 等于其动能 (\(K_n\),由速度计算得出)和势能 (\(V_n\),由位置计算得出)之和。
Step 4: Acceptance Test The final step is the
acceptance criterion: \(\text{acc}(o \to n) =
\min(1, e^{-\beta(E_n - E_o)})\). 最后一步是验收标准:\(\text{acc}(o \to n) = \min(1, e^{-\beta(E_n -
E_o)})\)。 * This is the Metropolis acceptance criterion. The
algorithm decides whether to accept the new state \(n\) or reject it and stay in the old state
\(o\). 这是 Metropolis
验收标准。算法决定是接受新状态 \(n\)
还是拒绝它并保持旧状态 \(o\)。 * The
probability of acceptance depends on the change in total energy (\(E_n - E_o\)). If the new energy is lower,
the move is always accepted. If the new energy is higher, it might still
be accepted with a probability \(e^{-\beta(E_n
- E_o)}\), where \(\beta = 1/(k_B
T)\). This allows the system to escape from local energy minima.
验收概率取决于总能量的变化 (\(E_n -
E_o\))。如果新能量较低,则始终接受该移动。如果新的能量更高,它仍然可能以概率
\(e^{-\beta(E_n - E_o)}\) 被接受,其中
\(\beta = 1/(k_B
T)\)。这使得系统能够摆脱局部能量最小值。
Key Formulas and Notations
Maxwell-Boltzmann
Distribution麦克斯韦-玻尔兹曼分布: The formula for the velocity
distribution is given as: \(f(\vec{v}) =
\left(\frac{m}{2\pi k_B T}\right)^{3/2} \exp\left(-\frac{m v^2}{2 k_B
T}\right)\) This gives the probability density for a particle of
mass \(m\) to have a velocity \(\vec{v}\) at a given temperature \(T\).质量为 \(m\) 的粒子速度为 的概率密度
Energy Conservation and Acceptance Rate: The
notes \(E_n \approx E_o\) and \(75\%\) highlight a key advantage of HMC.
Because the Verlet integrator approximately conserves energy, the final
energy \(E_n\) after the MD trajectory
is usually very close to the initial energy \(E_o\). This means the term \((E_n - E_o)\) is small, and the acceptance
probability is high. The \(75\%\)
indicates a typical or target acceptance rate for HMC, which is
significantly higher than for standard Metropolis MC. 注释 \(E_n \approx E_o\) 和 \(75\%\) 凸显了 HMC 的一个关键优势。由于
Verlet 积分器近似地守恒能量,MD 轨迹后的最终能量 \(E_n\) 通常非常接近初始能量 \(E_o\)。这意味着 \((E_n - E_o)\) 项很小,接受概率很高。\(75\%\) 表示 HMC
的典型或目标接受率,明显高于标准 Metropolis MC。
Hamiltonian Operator: The symbol \(\hat{H}\) written on the adjacent board
represents the Hamiltonian operator, which gives the total energy of the
system. The note Δ Adiabatic suggests that the MD evolution
is ideally an adiabatic process (no heat exchange), during which the
total energy (the Hamiltonian) is conserved. 相邻板上的符号 \(\hat{H}\)
代表哈密顿算符,它给出了系统的总能量。注释“Δ Adiabatic”表明 MD
演化在理想情况下是一个绝热过程(无热交换),在此过程中总能量(哈密顿量)守恒。
This whiteboard displays the fundamental equation of quantum
chemistry: the time-dependent Schrödinger equation, along with the
detailed breakdown of the molecular Hamiltonian operator. This equation
is the starting point for almost all ab initio
(first-principles) quantum mechanical calculations of molecular systems.
这块白板展示了量子化学的基本方程:含时薛定谔方程,以及分子哈密顿算符的详细分解。该方程是几乎所有分子系统从头算(第一性原理)量子力学计算的起点。
3. The Time-Dependent
Schrödinger Equation
At the top of the board, the fundamental equation governing the
evolution of a quantum mechanical system is presented:
白板顶部显示了控制量子力学系统演化的基本方程: \(i\hbar \frac{\partial \Psi}{\partial t} =
\hat{\mathcal{H}} \Psi\)
\(\Psi\) (Psi)
is the wave function of the system. It contains all the
information that can be known about the system (e.g., the positions and
momenta of all particles).
是系统的波函数。它包含了关于系统的所有已知信息(例如,所有粒子的位置和动量)。
\(\hat{\mathcal{H}}\) is the
Hamiltonian operator, which represents the total energy
of the system.
是哈密顿算符,表示系统的总能量。
\(i\)
是虚数单位。
\(i\) is the
imaginary unit.
\(\hbar\) is
the reduced Planck
constant.是约化普朗克常数。
\(\frac{\partial \Psi}{\partial
t}\) represents how the wave function changes over
time.表示波函数随时间的变化。
This equation states that the time evolution of the quantum state is
dictated by the system’s total energy operator, the Hamiltonian. The
note “Δ Adiabatic process” likely connects to the context of the
Born-Oppenheimer approximation, where the electronic Schrödinger
equation is solved for fixed nuclear positions, assuming the electrons
adjust adiabatically (instantaneously) to the motion of the nuclei.
该方程表明,量子态的时间演化由系统的总能量算符——哈密顿算符决定。注释“Δ绝热过程”与玻恩-奥本海默近似相关,在该近似中,电子薛定谔方程是针对固定原子核位置求解的,假设电子以绝热方式(瞬时)调整以适应原子核的运动。
4. The Full
Molecular Hamiltonian (\(\hat{\mathcal{H}}\))
The main part of the whiteboard is the detailed expression for the
non-relativistic, time-independent molecular Hamiltonian. It is the sum
of the kinetic and potential energies of all the nuclei and electrons in
the system. The equation can be broken down into five distinct terms:
白板的主要部分是非相对论性、时间无关的分子哈密顿量的详细表达式。它是系统中所有原子核和电子的动能和势能之和。
Kinetic Energy of the Nuclei 原子核的动能:\(-\sum_{I=1}^{P}
\frac{\hbar^2}{2M_I}\nabla_I^2\) This term is the sum of the
kinetic energy operators for all the nuclei in the
system.此项是系统中所有原子核的动能算符之和。
The sum is over all nuclei, indexed by \(I\) from 1 to \(P\).该和涵盖所有原子核,索引为 \(I\),从 1 到 \(P\)。
\(M_I\) is the mass of nucleus
\(I\).是原子核 \(I\) 的质量。
\(\nabla_I^2\) is the Laplacian
operator, which involves the second spatial derivatives with respect to
the coordinates of nucleus \(I\).是拉普拉斯算符,它涉及原子核 \(I\) 坐标的二阶空间导数。
Kinetic Energy of the Electrons 电子的动能:\(-\sum_{i=1}^{N}
\frac{\hbar^2}{2m}\nabla_i^2\) This is the corresponding sum of
the kinetic energy operators for all the
electrons.这是所有电子的动能算符的对应和。
The sum is over all electrons, indexed by \(i\) from 1 to \(N\).该和是针对所有电子的,索引为 \(i\),从 1 到 \(N\)。
\(m\) is the mass of an
electron.是电子的质量。
\(\nabla_i^2\) is the Laplacian
operator with respect to the coordinates of electron \(i\).是关于电子 \(i\) 坐标的拉普拉斯算符。
B. Potential Energy Terms (Electrostatic Interactions)
势能项(静电相互作用)
Nuclear-Nuclear Repulsion 核间排斥力:\(+\frac{e^2}{2}\sum_{I=1}^{P}\sum_{J \neq I}^{P}
\frac{Z_I Z_J}{|\vec{R}_I - \vec{R}_J|}\) This term represents
the potential energy from the electrostatic (Coulomb) repulsion between
all pairs of positively charged
nuclei.该项表示所有带正电原子核对之间静电(库仑)排斥力产生的势能。
The double summation runs over all unique pairs of nuclei (\(I, J\)).对所有唯一的原子核对 (\(I, J\)) 进行双重求和。
\(Z_I\) is the atomic number (i.e.,
the charge) of nucleus \(I\).是原子核
\(I\) 的原子序数(即电荷)。
\(\vec{R}_I\) is the position
vector of nucleus \(I\).是原子核 \(I\) 的位置矢量。
\(e\) is the elementary
charge.是基本电荷。
Electron-Electron Repulsion 电子间排斥力:\(+\frac{e^2}{2}\sum_{i=1}^{N}\sum_{j \neq i}^{N}
\frac{1}{|\vec{r}_i - \vec{r}_j|}\) This term represents the
potential energy from the electrostatic repulsion between all pairs of
negatively charged
electrons.该项表示所有带负电的电子对之间静电排斥的势能。
The double summation runs over all unique pairs of electrons (\(i, j\)).对所有不同的电子对 (\(i, j\)) 进行双重求和。
\(\vec{r}_i\) is the position
vector of electron \(i\).是电子 \(i\) 的位置矢量。
Nuclear-Electron Attraction 核-电子引力:\(-e^2\sum_{I=1}^{P}\sum_{i=1}^{N}
\frac{Z_I}{|\vec{R}_I - \vec{r}_i|}\) This final term represents
the potential energy from the electrostatic attraction between the
nuclei and the electrons.这最后一项表示原子核和电子之间静电引力的势能。
The summation runs over all nuclei and all
electrons.该求和适用于所有原子核和所有电子。
5. Notations and Conventions
Atomic Units: The note \(\frac{1}{4\pi\epsilon_0} = k = 1\) is a key
indicator of the convention being used. This sets the Coulomb constant
to 1, which is a hallmark of Hartree atomic units. In
this system, the elementary charge (\(e\)), electron mass (\(m\)), and reduced Planck constant (\(\hbar\)) are also set to 1. This simplifies
the Hamiltonian significantly, removing the physical constants and
making the equations easier to work with computationally.
是所用约定的关键指标。这将库仑常数设置为 1,这是Hartree
原子单位的标志。在这个系统中,基本电荷 (\(e\))、电子质量 (\(m\)) 和约化普朗克常数 (\(\hbar\)) 也设为
1。这显著简化了哈密顿量,消除了物理常数,使方程更易于计算。
Interaction Terms: The notations \(\{i, j\}\), \(\{i, j, k\}\), etc., refer to the
“many-body” problem. The Hamiltonian contains two-body terms
(interactions between pairs of particles), and solving the Schrödinger
equation exactly is extremely difficult because the motion of every
particle is correlated with every other particle. Computational methods
are designed to approximate these interactions. 符号 \(\{i, j\}\)、\(\{i, j, k\}\)
等指的是“多体”问题。哈密顿量包含二体项(粒子对之间的相互作用),而精确求解薛定谔方程极其困难,因为每个粒子的运动都与其他粒子相关。计算方法旨在近似这些相互作用。
This whiteboard presents the mathematical foundation for
non-adiabatic molecular dynamics, a sophisticated
method in theoretical chemistry and physics used to simulate processes
where the Born-Oppenheimer approximation breaks down. This typically
occurs in photochemistry, electron transfer reactions, and when
molecules interact with intense laser fields.
这块白板展示了非绝热分子动力学的数学基础,这是理论化学和物理学中一种复杂的方法,用于模拟玻恩-奥本海默近似失效的过程。这通常发生在光化学、电子转移反应以及分子与强激光场相互作用时。
The title “Δ non-adiabatic MD” indicates that the topic moves beyond
the standard Born-Oppenheimer approximation. In this approximation, it
is assumed that the light electrons adjust instantaneously to the motion
of the heavy nuclei, allowing the system to be described by a single
potential energy surface. Non-adiabatic methods, by contrast, account
for the quantum mechanical coupling between multiple electronic
states.
The starting point for this method is the “ansatz” (an educated guess
for the form of the solution). This is the Born-Huang expansion for the
total molecular wave function, \(\Psi\).
该方法的起点是“拟设”(对解形式的合理猜测)。这是分子总波函数 \(\Psi\) 的玻恩-黄展开式。
\(\Psi(\vec{R}, \vec{r},
t)\) is the total wave function for the entire molecule.
It depends on the coordinates of all nuclei (\(\vec{R}\)), all electrons (\(\vec{r}\)), and time (\(t\)).
是整个分子的总波函数。它取决于所有原子核 (\(\vec{R}\))、所有电子 (\(\vec{r}\)) 和时间 (\(t\)) 的坐标。
\(\Phi_n(\vec{R},
\vec{r})\) are the electronic wave
functions. They are the solutions to the electronic Schrödinger
equation for a fixed nuclear geometry \(\vec{R}\) and form a complete basis set.
The index \(n\) labels the electronic
state (e.g., ground state, first excited state, etc.).
它们是给定原子核几何构型 \(\vec{R}\)
的电子薛定谔方程的解,并构成一个完整的基组。下标 \(n\)
标记电子态(例如,基态、第一激发态等)。
\(\Theta_n(\vec{R},
t)\) are the nuclear wave functions.
Each \(\Theta_n\) describes the motion
of the nuclei on the potential energy surface of the corresponding
electronic state, \(\Phi_n\).
Crucially, they depend on time. 是核波函数。每个 \(\Theta_n\) 描述原子核在相应电子态 \(\Phi_n\)
势能面上的运动。至关重要的是,它们依赖于时间。
This ansatz expresses the total molecular state as a superposition of
electronic states, where the coefficients of the superposition are the
nuclear wave functions.
该拟设将总分子态表示为电子态的叠加,其中叠加的系数是核波函数。
8. The
Partitioned Molecular Hamiltonian 分割分子哈密顿量
The total molecular Hamiltonian, \(\hat{\mathcal{H}}\), is partitioned into
terms that act on the nuclei and electrons separately. 总分子哈密顿量
\(\hat{\mathcal{H}}\)
被分割成分别作用于原子核和电子的项。
\(-\sum_{I}
\frac{\hbar^2}{2M_I}\nabla_I^2\): This is the kinetic
energy operator for the nuclei, often denoted as \(\hat{T}_n\).这是原子核的动能算符,通常表示为
\(\hat{T}_n\)。
\(\hat{\mathcal{H}}_e\): This is the
electronic Hamiltonian, which includes the kinetic
energy of the electrons and the potential energy of electron-electron
and electron-nuclear interactions.
这是电子哈密顿量,包含电子的动能以及电子-电子和电子-核相互作用的势能。
\(\hat{V}_{nn}\): This is the
potential energy operator for nuclear-nuclear
repulsion.这是核-核排斥的势能算符。
9. The
Electronic Schrödinger Equation 电子薛定谔方程
The electronic basis functions, \(\Phi_n\), are defined as the eigenfunctions
of the electronic Hamiltonian (plus the nuclear repulsion term) for a
fixed nuclear configuration \(\vec{R}\). 电子基函数 \(\Phi_n\) 定义为对于固定的核构型 \(\vec{R}\),电子哈密顿量(加上核排斥项)的本征函数。
\(E_n(\vec{R})\)
are the eigenvalues, which are the potential energy surfaces
(PES). Each electronic state \(n\) has its own PES, which dictates the
forces acting on the nuclei when the molecule is in that electronic
state. 是特征值,即势能面 (PES)。每个电子态 \(n\)
都有其自身的势能面,它决定了分子处于该电子态时作用于原子核的力。
10.
Deriving the Equations of Motion for the Nuclei 推导原子核运动方程
The final part of the whiteboard begins the derivation of the
time-dependent Schrödinger equation for the nuclear wave functions,
\(\Theta_k\). The process starts with
the full time-dependent Schrödinger equation, \(i\hbar \frac{\partial \Psi}{\partial t} =
\hat{\mathcal{H}} \Psi\). To find the equation for a specific
nuclear wave function \(\Theta_k\),
this main equation is projected onto the corresponding electronic basis
state \(\Phi_k\).
白板的最后一部分开始推导原子核波函数 \(\Theta_k\)
的含时薛定谔方程。该过程从完整的含时薛定谔方程 \(i\hbar \frac{\partial \Psi}{\partial t} =
\hat{\mathcal{H}} \Psi\) 开始。为了找到特定原子核波函数 \(\Theta_k\)
的方程,需要将这个主方程投影到相应的电子基态 \(\Phi_k\) 上。
This is done by multiplying from the left by the complex conjugate of
the electronic wave function, \(\Phi_k^*\), and integrating over all
electronic coordinates, \(d\vec{r}\).
可以通过从左边乘以电子波函数 \(\Phi_k^*\) 的复共轭,然后在所有电子坐标
\(d\vec{r}\) 上积分来实现。
The board then shows the result of substituting the Born-Huang ansatz
for \(\Psi\) and the partitioned
Hamiltonian for \(\hat{\mathcal{H}}\)
into this projected equation: 然后,黑板显示将 Born-Huang 拟设式代入
\(\Psi\),将分块哈密顿量代入以下投影方程的结果:
Left Hand Side: The left side of the projection
has been simplified. Because the electronic basis functions \(\Phi_n\) form an orthonormal set (\(\int \Phi_k^* \Phi_n d\vec{r} =
\delta_{kn}\)), the sum collapses to a single term for \(n=k\). 投影左侧已简化。由于电子基函数 \(\Phi_n\) 构成一个正交集 (\(\int \Phi_k^* \Phi_n d\vec{r} =
\delta_{kn}\),因此当 \(n=k\)
时,和将折叠为一个项。
Right Hand Side: This complex integral is the
core of non-adiabatic dynamics. When the nuclear kinetic energy
operator, \(\nabla_I^2\), acts on the
product \(\Theta_n \Phi_n\), it acts on
both functions (via the product rule). The terms that arise from \(\nabla_I\) acting on the electronic wave
functions \(\Phi_n\) are known as
non-adiabatic coupling terms. These terms are
responsible for enabling transitions between different electronic
potential energy surfaces, which is the essence of non-adiabatic
dynamics. 这个复积分是非绝热动力学的核心。当核动能算符 \(\nabla_I^2\) 作用于乘积 \(\Theta_n \Phi_n\)
时,它会作用于这两个函数(通过乘积规则)。由 \(\nabla_I\) 作用于电子波函数 \(\Phi_n\)
而产生的项称为非绝热耦合项。这些术语负责实现不同电子势能面之间的转变,这是非绝热动力学的本质。
This whiteboard continues the mathematical derivation for
non-adiabatic molecular dynamics started in the previous image. It
focuses on expanding the nuclear kinetic energy term to reveal the
crucial couplings between different electronic
states.这块白板延续了上一张图片中非绝热分子动力学的数学推导。它着重于扩展核动能项,以揭示不同电子态之间的关键耦合。
11.
Starting Point: The Projected Schrödinger Equation
起点:投影薛定谔方程
The derivation picks up from the equation for the time evolution of
the nuclear wave function, \(\Theta_k\). The right-hand side of this
equation is being evaluated. 推导过程取自核波函数 \(\Theta_k\)
的时间演化方程。该方程的右边正在求值。
This equation separates the total energy into two parts
该方程将总能量分为两部分 : * The first term is the contribution from the
nuclear kinetic energy operator, \(-\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2\).
第一项是核动能算符的贡献 * The second term, \(E_k \Theta_k\), is the contribution from
the potential energy. This term arises from the action
of the electronic Hamiltonian part \((\hat{\mathcal{H}}_e + \hat{V}_{nn})\) on
the basis functions. Due to the orthonormality of the electronic
wavefunctions (\(\int \Phi_k^* \Phi_n
\,d\vec{r} = \delta_{kn}\)), the sum over \(n\) collapses to a single term for the
potential energy. 第二项,\(E_k
\Theta_k\),是势能的贡献。这一项源于电子哈密顿量部分
\((\hat{\mathcal{H}}_e +
\hat{V}_{nn})\) 对基函数的作用。由于电子波函数(\(\int \Phi_k^* \Phi_n \,d\vec{r} =
\delta_{kn}\))的正交性,\(n\)项的和会坍缩为势能的一项。
The challenge, and the core of the physics, lies in evaluating the
first term, as the nuclear derivative \(\nabla_I\) acts on both the
nuclear wave function \(\Theta_n\) and
the electronic wave function \(\Phi_n\).
难点在于,也是物理的核心在于如何计算第一项,因为核导数 \(\nabla_I\) 同时作用于核波函数 \(\Theta_n\) 和电子波函数 \(\Phi_n\)。
12.
Applying the Product Rule for the Laplacian
应用拉普拉斯算子的乘积规则
To expand the kinetic energy term, the product rule for the Laplacian
operator acting on two functions (A and B) is used. The board writes
this rule as: 为了展开动能项,我们利用了拉普拉斯算子作用于两个函数(A 和
B)的乘积规则。棋盘上将这条规则写成: \(\nabla^2(AB) = (\nabla^2 A)B + 2(\nabla
A)\cdot(\nabla B) + A(\nabla^2 B)\)
In our case, \(A = \Theta_n(\vec{R},
t)\) and \(B = \Phi_n(\vec{R},
\vec{r})\). The derivative \(\nabla_I\) is with respect to the nuclear
coordinates \(\vec{R}_I\).
在我们的例子中,\(A = \Theta_n(\vec{R},
t)\),\(B = \Phi_n(\vec{R},
\vec{r})\)。导数 \(\nabla_I\)
是关于原子核坐标 \(\vec{R}_I\) 的。
13. Expanding the
Kinetic Energy Term 展开动能项
Applying this rule, the integral containing the kinetic energy
operator is expanded: 应用此规则,展开包含动能算符的积分: \(= -\sum_I \frac{\hbar^2}{2M_I} \int \Phi_k^*
\sum_n \left( (\nabla_I^2 \Theta_n)\Phi_n + 2(\nabla_I
\Theta_n)\cdot(\nabla_I \Phi_n) + \Theta_n(\nabla_I^2 \Phi_n) \right)
d\vec{r} + E_k \Theta_k\)
This step explicitly shows how the nuclear kinetic energy operator
gives rise to three distinct types of
terms.此步骤明确展示了核动能算符如何产生三种不同类型的项。
14.
Final Result and Identification of Coupling Terms
最终结果及耦合项的识别
The final step is to take the integral over the electronic
coordinates (\(d\vec{r}\)) and
rearrange the terms. The expression is simplified by again using the
orthonormality of the electronic wave functions, \(\int \Phi_k^* \Phi_n \, d\vec{r} =
\delta_{kn}\). 最后一步是对电子坐标 (\(d\vec{r}\))
进行积分,并重新排列各项。再次利用电子波函数的正交性简化表达式,\(\int \Phi_k^* \Phi_n \, d\vec{r} =
\delta_{kn}\)。
This final equation is profound. It represents the time-independent
Schrödinger equation for the nuclear wave function \(\Theta_k\), but it is coupled to all other
nuclear wave functions \(\Theta_n\).
Let’s break down the key terms within the parentheses:
最后一个方程意义深远。它代表了核波函数 \(\Theta_k\)
的与时间无关的薛定谔方程,但它与所有其他核波函数 \(\Theta_n\)
耦合。让我们分解一下括号内的关键项:
\(\nabla_I^2
\Theta_k\): This is the standard kinetic energy term for
the nuclei moving on the potential energy surface of state \(k\). This is the only term that would
remain in the simple Born-Oppenheimer (adiabatic) approximation.
这是原子核在势能面 \(k\)
上运动的标准动能项。这是在简单的
Born-Oppenheimer(绝热)近似中唯一保留的项。
\(\left( \int \Phi_k^* \nabla_I
\Phi_n \, d\vec{r} \right)\): This is the
first-derivative non-adiabatic coupling term (NACT),
often called the derivative coupling. This vector quantity determines
the strength of the coupling between electronic states \(k\) and \(n\) due to the velocity of the nuclei. It
is the primary term responsible for enabling transitions between
different potential energy surfaces. 这是一阶导数非绝热耦合项
(NACT),通常称为导数耦合。该矢量决定了由于原子核速度而导致的电子态
\(k\) 和 \(n\)
之间耦合的强度。它是实现不同势能面之间跃迁的主要项。
\(\left( \int \Phi_k^*
\nabla_I^2 \Phi_n \, d\vec{r} \right)\): This is the
second-derivative non-adiabatic coupling term, a scalar
quantity. While often smaller than the first-derivative term, it is also
part of the complete description of non-adiabatic effects.
是二阶导数非绝热耦合项,一个标量。虽然它通常小于一阶导数项,但它也是非绝热效应完整描述的一部分。
In summary, this derivation shows mathematically how the motion of
the nuclei (via the \(\nabla_I\)
operator) can induce quantum mechanical transitions between different
electronic states (\(\Phi_k \leftrightarrow
\Phi_n\)). The strength of these transitions is governed by the
non-adiabatic coupling terms, which depend on how the electronic wave
functions change as the nuclear geometry changes.
总之,该推导从数学上展示了原子核的运动(通过 \(\nabla_I\)
算符)如何诱导不同电子态之间的量子力学跃迁(\(\Phi_k \leftrightarrow
\Phi_n\))。这些跃迁的强度由非绝热耦合项控制,而非绝热耦合项又取决于电子波函数如何随原子核几何结构的变化而变化。
This whiteboard concludes the derivation of the equations for
non-adiabatic molecular dynamics by defining the coupling operator and
then showing how different levels of approximation—specifically the
Born-Huang and the more restrictive Born-Oppenheimer
approximations—arise from neglecting certain coupling terms.
这块白板通过定义耦合算符,并展示不同程度的近似——特别是 Born-Huang
近似和更严格的 Born-Oppenheimer
近似——是如何通过忽略某些耦合项而产生的,从而推导出非绝热分子动力学方程的。
15.
Definition of the Non-Adiabatic Coupling Operator
非绝热耦合算符的定义
The whiteboard begins by collecting all the non-adiabatic coupling
terms derived previously into a single operator, \(C_{kn}\).
白板首先将之前推导的所有非绝热耦合项合并为一个算符 \(C_{kn}\)。
This operator, \(C_{kn}\),
represents the total effect of the coupling between electronic state
\(k\) and electronic state \(n\), which is induced by the kinetic energy
of the nuclei. 此算符 \(C_{kn}\)
表示由原子核动能引起的电子态 \(k\)
和电子态 \(n\) 之间耦合的总效应。
The operator acts on the nuclear wave function that follows it in
the full equation. The \(\nabla_I\)
term acts as a derivative on that wave function.
该算符作用于完整方程中跟随它的核波函数。\(\nabla_I\) 项充当该波函数的导数。
16. The Coupled
Equations of Motion 耦合运动方程
Using this compact definition, the full set of coupled time-dependent
Schrödinger equations for the nuclear wave functions can be written as:
基于此简洁定义,核波函数的完整耦合含时薛定谔方程组可以写成:
This is the central result. It shows that the time evolution of the
nuclear wave function on a given potential energy surface \(k\) (described by \(\Theta_k\)) depends on two things:
这是核心结论。它表明,核波函数在给定势能面 \(k\)(用 \(\Theta_k\)
描述)上的时间演化取决于两个因素: 1. The motion on its own surface,
governed by its kinetic energy and the potential \(E_k\). 其自身表面上的运动,由其动能和势能
\(E_k\) 控制。 2. The influence of the
nuclear wave functions on all other electronic surfaces (\(\Theta_n\)), mediated by the coupling
operators \(C_{kn}\).
核波函数对所有其他电子表面(\(\Theta_n\))的影响,由耦合算符 \(C_{kn}\) 介导。
17. The Born-Huang
Approximation 玻恩-黄近似
The first and most crucial approximation is introduced to simplify
this complex set of coupled equations.
为了简化这组复杂的耦合方程,引入了第一个也是最重要的近似。
If \(C_{kn} = 0\) for \(k \neq n\) (Born-Huang
approximation)
This approximation assumes that the off-diagonal
coupling terms, which are responsible for transitions between different
electronic states, are negligible. However, it retains the
diagonal coupling term (\(C_{kk}\)). This leads to a simplified,
uncoupled equation:
该近似假设导致不同电子态之间跃迁的非对角耦合项可以忽略不计。然而,它保留了对角耦合项(\(C_{kk}\))。这可以得到一个简化的非耦合方程:
The term \(C_{kk}\) is known as the
diagonal Born-Oppenheimer correction (DBOC). It
represents a small correction to the potential energy surface \(E_k\) that arises from the fact that the
electrons do not adjust perfectly and instantaneously to the nuclear
motion, even within the same electronic state. \(C_{kk}\)
项被称为对角玻恩-奥本海默修正 (DBOC)。它表示对势能面
\(E_k\)
的微小修正,其原因是即使在相同的电子态下,电子也无法完美且即时地适应核运动。
Note on Real Wavefunctions 关于实波函数的注释: The
board shows that for real wavefunctions, the first-derivative part of
the diagonal correction vanishes: \(\int
\Phi_k \nabla_I \Phi_k \, d\vec{r} = 0\). This is because the
integral is related to the gradient of the normalization condition,
\(\nabla_I \int \Phi_k^2 \, d\vec{r} =
\nabla_I(1) = 0\), which expands to \(2\int \Phi_k \nabla_I \Phi_k \, d\vec{r} =
0\). 黑板显示,对于实波函数,对角修正的一阶导数部分为零:\(\int \Phi_k \nabla_I \Phi_k \, d\vec{r} =
0\)。这是因为积分与归一化条件的梯度有关,\(\nabla_I \int \Phi_k^2 \, d\vec{r} = \nabla_I(1) =
0\),其展开为 \(2\int \Phi_k \nabla_I
\Phi_k \, d\vec{r} = 0\)。
18. The
Born-Oppenheimer Approximation 玻恩-奥本海默近似
The final and most widely used approximation is the Born-Oppenheimer
approximation. It is more restrictive than the Born-Huang approximation.
最后一种也是最广泛使用的近似方法是玻恩-奥本海默近似。它比玻恩-黄近似更具限制性。
If \(C_{kk} = 0\)
(Born-Oppenheimer approximation) 若\(C_{kk} =
0\)(玻恩-奥本海默近似)
This assumes that the diagonal correction term is also negligible. By
setting all \(C_{kn}=0\) (both diagonal
and off-diagonal), the equations become completely decoupled, and the
nuclear motion evolves independently on each potential energy surface.
这假设对角修正项也可忽略不计。通过令所有\(C_{kn}=0\)(包括对角和非对角),方程组完全解耦,原子核运动在每个势能面上独立演化。
The result is the standard time-dependent Schrödinger
equation for the nuclei:
由此可得标准的原子核的含时薛定谔方程:
This equation is the foundation of most of quantum chemistry. It
states that the nuclei move on a static potential energy surface \(E_k(\vec{R})\) provided by the electrons,
without any possibility of transitioning to other electronic states or
having the surface be corrected by their own motion.
$ # URL ## Set your site url here. For example, if you use GitHub Page, set url as 'https://username.github.io/project' $ url: https://TianyaoBlogs.github.io/
$ root: /
$ permalink: :year/:month/:day/:title/
1
$ <img src="/imgs/5054C3/General_linear_regression_model.png" alt="A diagram of the general linear regression model">
This whiteboard explains the process of calculating the
radial distribution function, often denoted as \(g(r)\), to analyze the atomic structure of
a material, which is referred to here as a “film”.
本白板解释了计算径向分布函数(通常表示为 \(g(r)\))的过程,用于分析材料(本文中称为“薄膜”)的原子结构。
In simple terms, the radial distribution function tells you the
probability of finding an atom at a certain distance from another
reference atom. It’s a powerful way to see the local structure in a
disordered system like a liquid or an amorphous solid.
## Core
Concept: Radial Distribution Function 径向分布函数
The main goal is to compute the radial distribution function, \(g(r)\), which is defined as the ratio of
the actual number of atoms found in a thin shell at a distance \(r\) to the number of atoms you’d expect to
find if the material were an ideal gas (completely random).
主要目标是计算径向分布函数 \(g(r)\),其定义为在距离 \(r\)
的薄壳层中实际发现的原子数与材料为理想气体(完全随机)时预期发现的原子数之比。
The formula is expressed as: \[g(r)dr =
\frac{n(r)}{\text{ideal gas}}\]
\(n(r)\):
Represents the average number of atoms found in a thin spherical shell
between a distance \(r\) and \(r+dr\) from a central atom.
表示距离中心原子 \(r\) 到 \(r+dr\) 之间的薄球壳中原子的平均数量。
ideal gas: Represents the number of atoms you would
expect in that same shell if the atoms were distributed completely
randomly with the same average density (\(\rho\)). The volume of this shell is
approximately \(4\pi r^2
dr\).表示如果原子完全随机分布且平均密度 (\(\rho\))
相同,则该球壳中原子的数量。该球壳的体积约为 \(4\pi r^2 dr\)。
A peak in the \(g(r)\) plot
indicates a high probability of finding neighboring atoms at that
specific distance, revealing the material’s structural shells (e.g.,
nearest neighbors, second-nearest neighbors, etc.).\(g(r)\)
图中的峰值表示在该特定距离处找到相邻原子的概率很高,从而揭示了材料的结构壳(例如,最近邻、次近邻等)。
## Calculation Method
The board outlines a two-step averaging process to get a
statistically meaningful result from simulation data (a “film” at 20
frames per second).
Average over atoms: In a single frame (a
snapshot in time), you pick one atom as the center. Then, you count how
many other atoms (\(n(r)\)) are in
concentric spherical shells around it. This process is repeated,
treating each atom in the frame as the center, and the results are
averaged.
Average over frames: The entire process
described above is repeated for multiple frames from the simulation or
video. This time-averaging ensures that the final result represents the
typical structure of the material over time, smoothing out random
fluctuations.
The board notes “dx = bin width 0.01Å”, which is a practical detail
for the calculation. To create a histogram, the distance r
is divided into small segments (bins) of 0.01 angstroms.
## Connection to Experiments
Finally, the whiteboard mentions “frame X-ray
scattering”. This is a crucial point because it connects this
computational analysis to real-world experiments. Experimental
techniques like X-ray or neutron scattering can be used to measure a
quantity called the structure factor, \(S(q)\), which is directly related to the
radial distribution function \(g(r)\)
through a mathematical operation called a Fourier transform. This allows
scientists to directly compare the structure produced in their
simulations with the structure of a real material measured in a lab.
最后,白板上提到了“帧 X
射线散射”。这一点至关重要,因为它将计算分析与实际实验联系起来。X射线或中子散射等实验技术可以用来测量一个称为结构因子\(S(q)\)的量,该量通过傅里叶变换的数学运算与径向分布函数\(g(r)\)直接相关。这使得科学家能够直接将模拟中产生的结构与实验室测量的真实材料结构进行比较。
The board correctly links \(g(r)\)
to X-ray scattering experiments. The quantity measured in these
experiments is the static structure factor, \(S(q)\), which describes how the material
scatters radiation. The relationship between the two is a Fourier
transform: 该板正确地将\(g(r)\)与X射线散射实验联系起来。这些实验中测量的量是静态结构因子\(S(q)\),它描述了材料如何散射辐射。两者之间的关系是傅里叶变换:
\[S(q) = 1 + 4 \pi \rho \int_0^\infty [g(r) -
1] r^2 \frac{\sin(qr)}{qr} dr\] This equation is crucial because
it bridges the gap between computer simulations (which calculate \(g(r)\)) and physical experiments (which
measure \(S(q)\)).
这个方程至关重要,因为它弥合了计算机模拟(计算 \(g(r)\))和物理实验(测量 \(S(q)\))之间的差距。
##
2. The Gaussian Distribution: Probability of Particle Position
高斯分布:粒子位置的概率
The board starts with the formula for a one-dimensional
Gaussian (or normal) distribution:
白板首先展示的是一维高斯(或正态)分布的公式:
This equation describes the probability of finding a particle at a
specific position x after a certain amount of time has
passed. * \(\mu\) (mu)
is the mean or average position. For a simple diffusion
process starting at the origin, the particles spread out symmetrically,
so the average position remains at the origin (\(\mu = 0\)). * \(\sigma^2\) (sigma squared) is the
variance, which measures how spread out the particles
are from the mean position. A larger variance means the particles have,
on average, traveled farther from the starting point.
这个方程描述了经过一定时间后,在特定位置“x”找到粒子的概率。 *
\(\mu\) (mu)
是平均值或平均位置。对于从原点开始的简单扩散过程,粒子对称扩散,因此平均位置保持在原点(\(\mu = 0\))。 * \(\sigma^2\)(sigma 平方)
是方差,用于衡量粒子与平均位置的扩散程度。方差越大,意味着粒子平均距离起点越远。
The note “Black-Scholes” is a side reference. The Black-Scholes
model, famous in financial mathematics for pricing options, uses similar
mathematical principles based on Brownian motion to model the random
fluctuations of stock prices. “Black-Scholes”注释仅供参考。Black-Scholes
模型在金融数学中以期权定价而闻名,它使用基于布朗运动的类似数学原理来模拟股票价格的随机波动。
##
3. Mean Squared Displacement (MSD): Quantifying the Spread 均方位移
(MSD):量化扩散
The core of the board is dedicated to the Mean Squared
Displacement (MSD). This is the primary tool used to measure
how far, on average, particles have moved over a time interval
t. 本版块的核心内容是均方位移
(MSD)。这是用于测量粒子在时间间隔“t”内平均移动距离的主要工具。
The variance \(\sigma^2\) is
formally defined as the average of the squared deviations from the mean:
\[\sigma^2 = \langle x^2(t) \rangle - \langle
x(t) \rangle^2\] * \(\langle x(t)
\rangle\) is the average displacement. As mentioned, for simple
diffusion, \(\langle x(t) \rangle =
0\). * \(\langle x^2(t)
\rangle\) is the average of the square of the
displacement. 方差\(\sigma^2\)的正式定义为与平均值偏差平方的平均值:
\[\sigma^2 = \langle x^2(t) \rangle - \langle
x(t) \rangle^2\] * \(\langle x(t)
\rangle\)是平均位移。如上所述,对于简单扩散,\(\langle x(t) \rangle = 0\)。 * \(\langle x^2(t)
\rangle\)是位移平方的平均值。
Since \(\langle x(t) \rangle = 0\),
the variance is simply equal to the MSD: \[\sigma^2 = \langle x^2(t) \rangle\] 由于
\(\langle x(t) \rangle =
0\),方差等于均方差 (MSD): \[\sigma^2
= \langle x^2(t) \rangle\]
The crucial insight for a diffusive process is that the MSD
grows linearly with time. The rate of this growth is determined
by the diffusion coefficient, D. The board shows this
relationship for different dimensions: 扩散过程的关键在于MSD
随时间线性增长。其增长率由扩散系数
D决定。棋盘显示了不同维度下的这种关系:
1D:\(\langle x^2(t)
\rangle = 2Dt\) (Movement along a line) (沿直线运动)
2D: The board has a slight typo or ambiguity with
\(\langle z^2(t) \rangle = 2Dt\). For
2D motion in the x-y plane, the total MSD would be \(\langle r^2(t) \rangle = \langle x^2(t) \rangle +
\langle y^2(t) \rangle = 4Dt\). The note on the board might be
referring to just one component of motion. **棋盘上的 \(\langle z^2(t) \rangle = 2Dt\)
存在轻微拼写错误或歧义。对于 x-y 平面上的二维运动,总平均散射差 (MSD) 为
\(\langle r^2(t) \rangle = \langle x^2(t)
\rangle + \langle y^2(t) \rangle =
4Dt\)。黑板上的注释可能仅指运动的一个分量。
3D:\(\langle r^2(t)
\rangle = \langle |\vec{r}(t) - \vec{r}(0)|^2 \rangle = 6Dt\)
(Movement in 3D space, which is the most common case in molecular
simulations) (三维空间中的运动,这是分子模拟中最常见的情况) Here,
\(\vec{r}(t)\) is the position vector
of a particle at time t. The quantity \(\langle |\vec{r}(t) - \vec{r}(0)|^2
\rangle\) is the average of the squared distance a particle has
traveled from its initial position \(\vec{r}(0)\). 这里,\(\vec{r}(t)\) 是粒子在时间 t
的位置矢量。 \(\langle |\vec{r}(t) -
\vec{r}(0)|^2 \rangle\) 是粒子从其初始位置 \(\vec{r}(0)\) 行进距离的平方平均值。
##
4. The Einstein Relation: Connecting Microscopic Motion to a Macroscopic
Property 爱因斯坦关系:将微观运动与宏观特性联系起来
Finally, the board presents the famous Einstein
relation, which rearranges the 3D MSD equation to solve for the
diffusion coefficient D:
This is a cornerstone equation in statistical mechanics. It provides
a practical way to calculate a macroscopic property—the
diffusion coefficient D—from the
microscopic movements of individual particles observed in a computer
simulation.
这是统计力学中的一个基石方程。它提供了一种实用的方法,可以通过计算机模拟中观察到的单个粒子的微观运动来计算宏观属性——扩散系数“D”。
In practice, one would: 1. Run a simulation of particles.
运行粒子模拟。 2. Track the position of each particle over time.
跟踪每个粒子随时间的位置。 3. Calculate the squared displacement \(|\vec{r}(t) - \vec{r}(0)|^2\) for each
particle at various time intervals t.
计算每个粒子在不同时间间隔“t”的位移平方\(|\vec{r}(t) - \vec{r}(0)|^2\)。 4. Average
this value over all particles to get the MSD, \(\langle |\vec{r}(t) - \vec{r}(0)|^2
\rangle\). 对所有粒子取平均值,得到均方差(MSD),即\(\langle |\vec{r}(t) - \vec{r}(0)|^2
\rangle\)。 5. Plot the MSD as a function of time.
将MSD绘制成时间函数。 6. The slope of this line, divided by 6, gives the
diffusion coefficient D. The lim t→∞ indicates
that this linear relationship is most accurate for long time scales,
after initial transient effects have died down.
这条直线的斜率除以6,即扩散系数“D”。“lim
t→∞”表明,在初始瞬态效应消退后,这种线性关系在长时间尺度上最为准确。
## 5. Right Board: Green-Kubo
Relations
This board introduces a more advanced and powerful method to
calculate transport coefficients like the diffusion coefficient, known
as the Green-Kubo relations.
本面板介绍了一种更先进、更强大的方法来计算扩散系数等传输系数,即Green-Kubo
关系。
###
Velocity Autocorrelation Function (VACF) 速度自相关函数
(VACF)
The key idea is to look at how a particle’s velocity at one point in
time is related to its velocity at a later time. This is measured by the
Velocity Autocorrelation Function (VACF): \[C_{vv}(t) = \langle \vec{v}(t') \cdot
\vec{v}(t' + t) \rangle\] This function tells us how long a
particle “remembers” its velocity. For a typical liquid, the velocity is
quickly randomized by collisions, so the VACF decays to zero rapidly.
其核心思想是考察粒子在某一时间点的速度与其在之后时间点的速度之间的关系。这可以通过速度自相关函数
(VACF)来测量: \[C_{vv}(t) = \langle
\vec{v}(t') \cdot \vec{v}(t' + t) \rangle\]
此函数告诉我们粒子“记住”其速度的时间。对于典型的液体,速度会因碰撞而迅速随机化,因此
VACF 会迅速衰减为零。
### Connecting MSD and
VACF
The board shows the mathematical link between the MSD and the VACF.
Starting with the definition of position as the integral of velocity,
\(\vec{r}(t) = \int_0^t \vec{v}(t')
dt'\), one can show that the MSD is a double integral of the
VACF. The board writes this as: \[\langle
x^2(t) \rangle = \left\langle \left( \int_0^t v(t') dt' \right)
\left( \int_0^t v(t'') dt'' \right) \right\rangle =
\int_0^t dt' \int_0^t dt'' \langle v(t') v(t'')
\rangle\] This shows that the two pictures of motion—the
particle’s displacement (MSD) and its velocity fluctuations (VACF)—are
deeply connected. 该面板展示了 MSD 和 VACF
之间的数学联系。从位置定义为速度的积分开始,\(\vec{r}(t) = \int_0^t \vec{v}(t')
dt'\),可以证明 MSD 是 VACF 的二重积分。黑板上写着: \[\langle x^2(t) \rangle = \left\langle \left(
\int_0^t v(t') dt' \right) \left( \int_0^t v(t'')
dt'' \right) \right\rangle = \int_0^t dt' \int_0^t
dt'' \langle v(t') v(t'') \rangle\]
这表明,粒子运动的两幅图像——粒子的位移(MSD)和速度涨落(VACF)——之间存在着深刻的联系。
###
The Green-Kubo Formula for Diffusion
扩散的格林-久保公式
By combining the Einstein relation with the integral of the VACF, one
arrives at the Green-Kubo formula for the diffusion coefficient: \[D = \frac{1}{3} \int_0^\infty \langle \vec{v}(0)
\cdot \vec{v}(t) \rangle dt\] This incredible result states that
the macroscopic property of diffusion (\(D\)) is determined by the integral of the
microscopic velocity correlations. It’s often a more
efficient way to compute \(D\) in
simulations than calculating the long-time limit of the MSD.
将爱因斯坦关系与VACF积分相结合,可以得到扩散系数的格林-久保公式: \[D = \frac{1}{3} \int_0^\infty \langle \vec{v}(0)
\cdot \vec{v}(t) \rangle dt\]
这个令人难以置信的结果表明,扩散的宏观特性(\(D\))由微观速度关联的积分决定。在模拟中,这通常是计算\(D\)比计算MSD的长期极限更有效的方法。
##
6. The Grand Narrative: From Micro to Macro 宏大叙事:从微观到宏观
The previous whiteboards gave us two ways to calculate the
diffusion constant, D, from the microscopic random walk
of individual atoms:
之前的白板提供了两种从单个原子的微观随机游动计算扩散常数
D的方法: 1. Einstein Relation: From the
long-term slope of the Mean Squared Displacement (MSD). 根据均方位移
(MSD) 的长期斜率。 2. Green-Kubo Relation: From the
integral of the Velocity Autocorrelation Function (VACF).
根据速度自相关函数 (VACF) 的积分。
This new whiteboard shows how that single microscopic parameter,
D, governs the large-scale, observable process of diffusion
described by Fick’s Laws and the Diffusion
Equation. 这块新的白板展示了单个微观参数 D
如何控制菲克定律和扩散方程所描述的大规模可观测扩散过程。
## 1. The
Starting Point: A Liquid’s Structure 起点:液体的结构
The plot on the top left is the Radial Distribution Function,
\(g(r)\), which we discussed
in detail from the first whiteboard. 左上角的图是径向分布函数
\(g(r)\),我们在第一个白板上详细讨论过它。
The Plot: It shows the characteristic structure of
a liquid. The peaks are labeled “1st”, “2nd”, and “3rd”, corresponding
to the first, second, and third solvation shells
(layers of neighboring atoms).
它显示了液体的特征结构。峰分别标记为“第一”、“第二”和“第三”,分别对应于第一、第二和第三溶剂化壳层(相邻原子层)。
The Limit: The note lim r→∞ g(r) = 1
confirms that at large distances, the liquid has no long-range order, as
expected.注释“lim r→∞ g(r) =
1”证实了在远距离下,液体没有长程有序,这与预期一致。
System Parameters: The values T = 0.71
and ρ = 0.844 are the temperature and density of the
simulated system (likely in reduced or “Lennard-Jones” units) for which
this \(g(r)\) was calculated. 值“T =
0.71”和“ρ =
0.844”分别是模拟系统的温度和密度(可能采用约化或“Lennard-Jones”单位),用于计算此
\(g(r)\)。
This section sets the stage: we are looking at the dynamics within a
system that has this specific liquid-like structure.
本节奠定了基础:我们将研究具有特定类液体结构的系统内的动力学。
## 2. The
Macroscopic Laws of Diffusion 宏观扩散定律
The bottom-left and top-right sections introduce the continuum
equations that describe how concentration changes in space and time.
左下角和右上角部分介绍了描述浓度随空间和时间变化的连续方程。左下角和右上角部分介绍了描述浓度随空间和时间变化的连续方程。
### Fick’s First Law
菲克第一定律
\[\vec{J} = -D \nabla C\] This is
Fick’s first law of diffusion. It states that there is a
flux of particles (\(\vec{J}\)), meaning a net flow. This flow
is directed from high concentration to low concentration (hence the
minus sign) and its magnitude is proportional to the
concentration gradient (\(\nabla C\)).
这是菲克第一扩散定律。它指出存在粒子的通量 (\(\vec{J}\)),即净流量。该流量从高浓度流向低浓度(因此带有负号),其大小与浓度梯度
(\(\nabla C\)) 成正比。
The Crucial Link: The proportionality constant is
D, the very same diffusion constant we
calculated from the microscopic random walk (MSD/VACF). This is the key
connection: the collective result of countless individual random walks
is a predictable net flow of particles.
比例常数是D,与我们根据微观随机游走 (MSD/VACF)
计算出的扩散常数完全相同。这是关键的联系:无数个体随机游动的集合结果是可预测的粒子净流。
###
The Diffusion Equation (Fick’s Second Law)
扩散方程(菲克第二定律)
\[\frac{\partial C(\vec{r},t)}{\partial t}
= D \nabla^2 C(\vec{r},t)\] This is the diffusion
equation, one of the most important equations in physics and
chemistry (also called the heat equation, as noted). It’s derived from
Fick’s first law and the principle of mass conservation (\(\frac{\partial C}{\partial t} + \nabla \cdot
\vec{J} = 0\)). It’s a differential equation that tells you
exactly how the concentration at any point, \(C(\vec{r},t)\), will change over time.
这就是扩散方程,它是物理学和化学中最重要的方程之一(也称为热方程)。它源于菲克第一定律和质量守恒定律(\(\frac{\partial C}{\partial t} + \nabla \cdot
\vec{J} = 0\))。它是一个微分方程,可以精确地告诉你任意一点的浓度
\(C(\vec{r},t)\) 随时间的变化。
##
3. The Solution: Connecting Back to the Random Walk
与随机游动联系起来
This is the most beautiful part. The board shows the solution to the
diffusion equation for a very specific scenario, linking the macroscopic
equation directly back to the microscopic random walk.
黑板上展示了一个非常具体场景下扩散方程的解,将宏观方程直接与微观随机游动联系起来。
### The Initial
Condition 初始条件
The problem is set up by assuming all particles start at a single
point at time zero: \[C(\vec{r}, 0) =
\delta(\vec{r})\] This is a Dirac delta
function, representing an infinitely concentrated point source
at the origin. 问题假设所有粒子在时间零点处从一个点开始: \[C(\vec{r}, 0) = \delta(\vec{r})\]
这是一个狄拉克函数,表示一个在原点处无限集中的点源。
###
The Fundamental Solution (Green’s Function)
基本解(格林函数)
The solution to the diffusion equation with this starting condition
is called the fundamental solution or Green’s
function. For one dimension, it is: \[C(x,t) = \frac{1}{\sqrt{4\pi Dt}}
\exp\left(-\frac{x^2}{4Dt}\right)\]
The “Aha!” Moment: This is a Gaussian
distribution. Let’s compare it to the formula from the second
whiteboard: * The mean is \(\mu=0\).
均值为 \(\mu=0\)。 * The variance is
\(\sigma^2 = 2Dt\). 方差为 \(\sigma^2 = 2Dt\)。
This is an incredible result. The macroscopic diffusion equation
predicts that a concentration pulse will spread out over time, and the
shape of the concentration profile will be a Gaussian curve. The width
of this curve, measured by its variance \(\sigma^2\), is exactly the Mean
Squared Displacement, \(\langle x^2(t)
\rangle\), of the individual random-walking particles.
宏观扩散方程预测浓度脉冲会随时间扩散,浓度分布的形状将是高斯曲线。这条曲线的宽度,用其方差
\(\sigma^2\)
来衡量,恰好是单个随机游动粒子的均方位移 \(\langle x^2(t) \rangle\)。
This perfectly unites the two perspectives: * Microscopic微观
(Board 2): Particles undergo a random walk, and their average
squared displacement from the origin grows as \(\langle x^2(t) \rangle = 2Dt\).
粒子进行随机游动,它们相对于原点的平均平方位移随着 \(\langle x^2(t) \rangle = 2Dt\)
的增长而增长。 * Macroscopic宏观 (This Board): A
collection of these particles, described by a continuum concentration
C, spreads out in a Gaussian profile whose variance is
\(\sigma^2 = 2Dt\).
这些粒子的集合,用连续浓度“C”来描述,呈方差为 \(\sigma^2 = 2Dt\) 的高斯分布。
内容: This whiteboard serves as an excellent
summary, pulling together all the key concepts we’ve discussed into a
single, cohesive picture. Let’s connect everything on this slide to our
detailed conversation.
1. RDF: The Static
Structure RDF静态结构
On the top left, you see RDF (Radial Distribution
Function).
The Plots: The board shows the familiar \(g(r)\) plot with its characteristic peaks
for a liquid. Below it is a plot of the interatomic potential energy,
\(V(r)\). This addition is very
insightful! It shows why the first peak in \(g(r)\) exists: it corresponds to the
minimum energy distance (\(\sigma\))
where particles are most stable and likely to be found.
白板展示了我们熟悉的\(g(r)\)图,它带有液体的特征峰。下方是原子间势能\(V(r)\)的图。这个补充非常有见地!它解释了为什么
\(g(r)\)
中的第一个峰值存在:它对应于粒子最稳定且最有可能被发现的最小能量距离
(\(\sigma\))。
Connection: This section summarizes our first
discussion. It’s the starting point for our analysis—a static snapshot
of the material’s average atomic arrangement before we consider how the
atoms move.
本节总结了我们的第一个讨论。这是我们分析的起点——在我们考虑原子如何运动之前,它是材料平均原子排列的静态快照。
2.
MSD and The Einstein Relation: The Displacement Picture 均方位移 (MSD)
和爱因斯坦关系:位移图像
The board then moves to dynamics, presenting two methods to calculate
the diffusion constant, D. The first is the
Einstein relation. 两种计算扩散常数
D的方法。第一种是爱因斯坦关系。
The Formula: It correctly states that the Mean
Squared Displacement (MSD), \(\langle r^2
\rangle\), is equal to \(6Dt\)
in three dimensions. It then rearranges this to solve for \(D\): 它正确地指出了均方位移 (MSD),\(\langle r^2 \rangle\),在三维空间中等于
\(6Dt\)。然后重新排列该公式以求解 \(D\): \[D =
\lim_{t\to\infty} \frac{\langle |\vec{r}(t) - \vec{r}(0)|^2
\rangle}{6t}\]
The Diagram: The central diagram beautifully
illustrates the concept. It shows a particle in a simulation box (with
“N=108” likely being the number of particles simulated) moving from an
initial position \(\vec{r}_i(0)\) to a
final position \(\vec{r}_i(t_j)\). The
MSD is the average of the square of this displacement over all particles
and many time origins. The graph labeled “MSD” shows how you would plot
this data and find the slope (“fitting”) to calculate \(D\).
中间的图表完美地阐释了这个概念。它展示了一个粒子在模拟框中(“N=108”
可能是模拟粒子的数量)从初始位置 \(\vec{r}_i(0)\) 移动到最终位置 \(\vec{r}_i(t_j)\)。MSD
是该位移平方在所有粒子和多个时间原点上的平均值。标有“MSD”的图表显示了如何绘制这些数据并找到斜率(“拟合”)来计算
\(D\)。
Connection: This is a perfect summary of the
“Displacement Picture” we analyzed on the second whiteboard. It’s the
most intuitive way to think about diffusion: how far particles spread
out over
time.这完美地总结了我们在第二个白板上分析的“位移图”。这是思考扩散最直观的方式:粒子随时间扩散的距离。
3.
The Green-Kubo Relation: The Fluctuation Picture
格林-久保关系:涨落图
Finally, the board presents the more advanced but often more
practical method: the Green-Kubo relation.
The Equations: This section displays the two key
equations from our last discussion:
The MSD as the double integral of the Velocity Autocorrelation
Function (VACF). 速度自相关函数 (VACF) 的二重积分的均方差 (MSD)。
The Diagram: The small diagram of a square with
axes \(t'\) and \(t''\) visually represents the
two-dimensional domain of integration for the double integral.
一个带有轴 \(t'\) 和 \(t''\)
的小正方形图直观地表示了二重积分的二维积分域。
Connection: This summarizes the “Fluctuation
Picture.” It shows the mathematical heart of the derivation that proves
the Einstein and Green-Kubo methods are equivalent. As we concluded,
this method is often numerically superior because it involves
integrating a rapidly decaying function (the VACF) rather than finding
the slope of a noisy, unbounded function (the MSD).
这概括了“涨落图”。它展示了证明爱因斯坦方法和格林-久保方法等价的推导过程的数学核心。正如我们总结的那样,这种方法通常在数值上更胜一筹,因为它涉及对快速衰减函数(VACF)进行积分,而不是求噪声无界函数(MSD)的斜率。
In essence, this single whiteboard is a complete roadmap for
analyzing diffusion in a molecular simulation. It shows how to first
characterize the material’s structure (\(g(r)\)) and then how to compute its key
dynamic property—the diffusion constant
D—using two powerful, interconnected methods.
本质上,这块白板就是分子模拟中分析扩散的完整路线图。它展示了如何首先表征材料的结构(\(g(r)\)),然后如何使用两种强大且相互关联的方法计算其关键的动态特性——扩散常数
D。
This whiteboard beautifully concludes the derivation of the
Green-Kubo relation, showing the final formulas and how they are used in
practice. It provides the punchline to the mathematical story we’ve been
following.
Let’s break down the details.
4. Finalizing the Derivation
The top lines of the board show the final step in connecting the Mean
Squared Displacement (MSD) to the Velocity Autocorrelation Function
(VACF).
The Left Side: As we know from the Einstein
relation, the long-time limit of the derivative of the 1D MSD,
\(\lim_{t\to\infty} \frac{d\langle x^2
\rangle}{dt}\), is simply equal to \(2D\).
The Right Side: This is the result of the
mathematical derivation from the previous slide. It shows that this same
quantity is also equal to twice the total integral of the VACF.
By equating these two, we can solve for the diffusion coefficient,
D.
5. The Velocity
Autocorrelation Function (VACF)
The board explicitly names the key quantity here:
\[\Phi(\tau) = \langle V_x(0) V_x(\tau)
\rangle\]
This is the “Velocity autocorrelation function”
(abbreviated as VAF on the board), which we’ve denoted as VACF. The
variable has been changed from t to τ (tau) to
represent a “time lag” or interval, which is common notation.
The Plot: The graph on the board shows a typical
plot of the VACF, \(\Phi(\tau)\),
versus the time lag \(\tau\).
It starts at a maximum positive value at \(\tau=0\) (when the velocity is perfectly
correlated with itself).
It rapidly decays towards zero as the particle undergoes collisions
that randomize its velocity.
The Integral: The shaded area under this curve
represents the value of the integral \(\int_0^\infty \Phi(\tau) d\tau\). The
Green-Kubo formula states that the diffusion coefficient is directly
proportional to this area.
6. The
Green-Kubo Formulas for the Diffusion Coefficient
After canceling the factor of 2, the board presents the final,
practical formulas for D.
In 1 Dimension:\[D =
\int_0^\infty d\tau \langle V_x(0) V_x(\tau) \rangle\]
In 3 Dimensions: This is the more general and
useful formula. \[D = \frac{1}{3}
\int_0^\infty d\tau \langle \vec{v}(0) \cdot \vec{v}(\tau)
\rangle\] There are two important changes for 3D:
We use the full velocity vectors and their dot
product, \(\vec{v}(0) \cdot
\vec{v}(\tau)\), to capture motion in all directions.
We divide by 3 to get the average contribution to
diffusion in any one direction (x, y, or z).
7. Practical Calculation
in a Simulation
The last formula on the board shows how this is implemented in a
computer simulation with a finite number of atoms.
\(\sum_{i=1}^{N}\): This
summation symbol indicates that you must compute the
VACF for each individual atom (from atom i=1 to
atom N).
\(\frac{1}{N}\):
You then average the results over all N
atoms in your simulation box.
\(\langle \dots
\rangle\): The angle brackets here still imply an
additional average over multiple different starting times
(t=0) to get good statistics.
This formula is the practical recipe: to get the diffusion
coefficient, you track the velocity of every atom, calculate each one’s
VACF, average them together, and then integrate the result over
time.