0%

MSDM 5054 - Statistical Machine Learning-L6

统计机器学习Lecture-6

Lecturer: Prof.XIA DONG

1. Linear Model Selection and Regularization 线性模型选择与正则化

Summary of Core Concepts

Chapter 6: Linear Model Selection and Regularization, focusing specifically on Section 6.1: Subset Selection. 第六章:线性模型选择与正则化6.1节:子集选择

  • The Problem: You have a dataset with many potential predictor variables (features). If you include all of them (like Model 1 with \(p\) predictors in slide ...221320.png), you risk including “noise” variables. These irrelevant features can decrease model accuracy (overfitting) and make the model difficult to interpret. 数据集包含许多潜在的预测变量(特征)。如果包含所有这些变量(例如幻灯片“…221320.png”中带有\(p\)个预测变量的模型1),则可能会包含“噪声”变量。这些不相关的特征会降低模型的准确率(过拟合),并使模型难以解释。

  • The Goal: Identify a smaller subset of variables that are truly related to the response. This creates a simpler, more interpretable, and often more accurate model (like Model 2 with \(q\) predictors). 找出一个与响应真正相关的较小变量子集。这将创建一个更简单、更易于解释且通常更准确的模型(例如带有\(q\)个预测变量的模型2)。

  • The Main Method Discussed: Best Subset Selection

  • 主要讨论的方法:最佳子集选择 This is an exhaustive search algorithm. It checks every possible combination of predictors to find the “best” model. With \(p\) variables, this means checking \(2^p\) total models. 这是一种穷举搜索算法。它检查所有可能的预测变量组合,以找到“最佳”模型。对于 \(p\) 个变量,这意味着需要检查总共 \(2^p\) 个模型。

    The algorithm (from slide ...221333.png) works in three steps:

    1. Step 1: Fit the “null model” \(M_0\), which has no predictors (it just predicts the average of the response). 拟合“空模型”\(M_0\),它没有预测变量(它只预测响应的平均值)。

    2. Step 2: For each \(k\) (from 1 to \(p\)):

      • Fit all \(\binom{p}{k}\) models that contain exactly \(k\) predictors. (e.g., fit all models with 1 predictor, then all models with 2 predictors, etc.).

      • 拟合所有包含 \(k\) 个预测变量的 \(\binom{p}{k}\) 个模型。(例如,先拟合所有包含 1 个预测变量的模型,然后拟合所有包含 2 个预测变量的模型,等等)。

      • From this group, select the single best model for that size \(k\). This “best” model is the one with the highest \(R^2\) (or lowest RSS - Residual Sum of Squares) on the training data. Call this model \(M_k\).

      • 从这组中,选择 对于该规模 \(k\) 的最佳模型。这个“最佳”模型是在 训练数据 上具有最高 \(R^2\)(或最低 RSS - 残差平方和)的模型。将此模型称为 \(M_k\)

    3. Step 3: You now have \(p+1\) models: \(M_0, M_1, \dots, M_p\). You must select the single best one from this list. To do this, you cannot use training \(R^2\) (as it will always pick the biggest model \(M_p\)). Instead, you must use a metric that estimates test error, such as: 现在你有 \(p+1\) 个模型:\(M_0, M_1, \dots, M_p\)。你必须从列表中选择一个最佳模型。为此,你不能**使用训练 \(R^2\)(因为它总是会选择最大的模型 \(M_p\))。相反,你必须使用一个能够估计测试误差的指标,例如:

      • Cross-Validation (CV) 交叉验证 (CV) (This is what the Python code uses)
      • AIC (Akaike Information Criterion 赤池信息准则)
      • BIC (Bayesian Information Criterion 贝叶斯信息准则)
      • Adjusted \(R^2\) 调整后的 \(R^2\)
  • Key Takeaway: The slides show this “subset selection” concept can be applied beyond linear models. The Python code demonstrates this by applying best subset selection to a K-Nearest Neighbors (KNN) Regressor, a non-linear model.“子集选择”的概念可以应用于线性模型之外

Mathematical Understanding & Key Questions 数学理解与关键问题

This section directly answers the questions posed on your slides.

How to compare which model is better?

(From slides ...221320.png and ...221326.png)

You cannot use training error (like \(R^2\) or RSS) to compare models with different numbers of predictors. A model with more predictors will almost always have a better training score, even if those extra predictors are just noise. This is called overfitting. 不能使用训练误差(例如 \(R^2\) 或 RSS)来比较具有不同数量预测变量的模型。具有更多预测变量的模型几乎总是具有更好的训练分数,即使这些额外的预测变量只是噪声。这被称为过拟合

To compare models of different sizes (like Model 1 vs. Model 2, or \(M_2\) vs. \(M_5\)), you must use a method that estimates test error (how the model performs on new, unseen data). The slides mention: 要比较不同大小的模型(例如模型 1 与模型 2,或 \(M_2\)\(M_5\)),您必须使用一种估算测试误差(模型在新的、未见过的数据上的表现)的方法。

  • Cross-Validation (CV): This is the gold standard. You split your data into “folds,” train the model on some folds, and test it on the remaining fold. You repeat this and average the test scores. The model with the best (e.g., lowest) average CV error is chosen. 将数据分成“折叠”,在一些折叠上训练模型,然后在剩余的折叠上测试模型。重复此操作并取测试分数的平均值。选择平均 CV 误差最小(例如,最小)的模型。

  • AIC & BIC: These are mathematical adjustments to the training error (like RSS) that add a penalty for having more predictors. They balance model fit with model complexity. 这些是对训练误差(如 RSS)的数学调整,会因预测变量较多而增加惩罚。它们平衡了模型拟合度和模型复杂度

Why use \(R^2\) in Step 2?

(From slide ...221333.png)

In Step 2, you are only comparing models of the same size (i.e., all models that have exactly \(k\) predictors). For models with the same number of parameters, a higher \(R^2\) (or lower RSS) on the training data directly corresponds to a better fit. You don’t need to penalize for complexity because all models being compared have the same complexity. 只比较大小相同的模型(即所有恰好具有 \(k\) 个预测变量的模型)。对于参数数量相同的模型,训练数据上更高的 \(R^2\)(或更低的 RSS)直接对应着更好的拟合度。您不需要对复杂度进行惩罚,因为所有被比较的模型都具有相同的复杂度

Why can’t we use training error in Step 3?

(From slide ...221333.png)

In Step 3, you are comparing models of different sizes (\(M_0\) vs. \(M_1\) vs. \(M_2\), etc.). As you add predictors, the training \(R^2\) will always go up (or stay the same), and the training RSS will always go down (or stay the same). If you used \(R^2\) to pick the best model in Step 3, you would always pick the most complex model \(M_p\), which is almost certainly overfit. 将比较不同大小的模型(例如 \(M_0\) vs. \(M_1\) vs. \(M_2\) 等)。随着您添加预测变量,训练 \(R^2\)始终上升(或保持不变),而训练 RSS 将始终下降(或保持不变)。如果您在步骤 3 中使用 \(R^2\) 来选择最佳模型,那么您始终会选择最复杂的模型 \(M_p\),而该模型几乎肯定会过拟合。

Therefore, you must use a metric that estimates test error (like CV) or penalizes for complexity (like AIC, BIC, or Adjusted \(R^2\)) to find the right balance between fit and simplicity. 因此,您必须使用一个可以估算测试误差(例如 CV)或惩罚复杂度(例如 AIC、BIC 或调整后的 \(R^2\))的指标来找到拟合度和简单性之间的平衡。

Code Analysis

The Python code (slides ...221249.jpg and ...221303.jpg) implements the Best Subset Selection algorithm using KNN Regression.

Key Functions

  • main():
    1. Loads Data: Reads the Credit.csv file.
    2. Preprocesses Data:
      • Converts categorical features (‘Gender’, ‘Student’, ‘Married’, ‘Ethnicity’) into numerical ones (dummy variables). 将分类特征(“性别”、“学生”、“已婚”、“种族”)转换为数值特征(虚拟变量)。
      • Creates the feature matrix X and target variable y (‘Balance’). 创建特征矩阵 X 和目标变量 y(“余额”)。
      • Scales the features using StandardScaler. This is crucial for KNN, which is sensitive to the scale of features. 用 StandardScaler 对特征进行缩放。这对于 KNN 至关重要,因为它对特征的缩放非常敏感。
    3. Adds Noise (in the second example): Slide ...221303.jpg shows code that adds 20 new “noisy” columns to the data. This is to test if the selection algorithm is smart enough to ignore them. 向数据中添加 20 个新的“噪声”列的代码。这是为了测试选择算法是否足够智能,能够忽略它们。
    4. Runs Selection: Calls best_subset_selection_parallel to do the main work.
    5. Prints Results: Finds the best subset (lowest error) and prints the top 20 best-performing subsets. 找到最佳子集(误差最小),并打印出表现最佳的前 20 个子集。
    6. Final Evaluation: It re-trains a KNN model on only the best subset and calculates the final cross-validated RMSE. 仅基于最佳子集重新训练 KNN 模型,并计算最终的交叉验证 RMSE。
  • evaluate_subset(subset, ...):
    • This is the “worker” function. It’s called for every single possible subset.
    • It takes a subset (a list of feature names, e.g., ['Income', 'Limit']).
    • It creates a new X_subset containing only those columns.
    • It runs 5-fold cross-validation (cross_val_score) on a KNN model using this X_subset.
    • It uses 'neg_mean_squared_error' as the metric. This is negative MSE; a higher score (closer to 0) is better. 它会创建一个新的“X_subset”,仅包含这些列。 它会使用此“X_subset”在 KNN 模型上运行 5 倍交叉验证(“cross_val_score”)。 它使用“neg_mean_squared_error”作为度量标准。这是负 MSE;更高*的分数(越接近 0)越好。
    • It returns the subset and its average CV score.
  • best_subset_selection_parallel(model, ...):
    • This is the “manager” function.这是“管理器”函数。
    • It iterates from k=1 up to the total number of features.它从“k=1”迭代到特征总数。
    • For each k, it generates all combinations of features of that size (this is the \(\binom{p}{k}\) part). 对于每个“k”,它会生成该大小的特征的所有组合(这是 \(\binom{p}{k}\) 部分)。
    • It uses Parallel and delayed (from joblib) to run evaluate_subset for all these combinations in parallel, speeding up the process significantly. 它使用 Paralleldelayed(来自 joblib)对所有这些组合并行运行 evaluate_subset,从而显著加快了处理速度。
    • It collects all the results and returns them.它收集所有结果并返回。

Analysis of the Output

  • Slide ...221255.png (Original Data):
    • The code runs subset selection on the original dataset.
    • The “Top 20 Best Feature Subsets” are shown. The CV scores are negative (they are neg_mean_squared_error), so the scores closest to zero (smallest magnitude) are best.
    • The Best feature subset is found to be ('Income', 'Limit', 'Rating', 'Student').
    • The final cross-validated RMSE for this model is 105.41.
  • Slide ...221309.png (Data with 20 Noisy Variables):
    • The code is re-run after adding 20 useless “Noisy” features.
    • The algorithm still works. It correctly identifies that the “Noisy” variables are useless.
    • The Best feature subset is now ('Income', 'Limit', 'Student'). (Note: ‘Rating’ was dropped, likely because it’s highly correlated with ‘Limit’, and the noisy data made the simpler model perform slightly better in CV).
    • The final RMSE is 114.94. This is higher than the original 105.41, which is expected—the presence of so many noise variables makes the selection problem harder, but the final model is still good and, most importantly, it successfully excluded all 20 noisy features. 最终的 RMSE 为 114.94。这比最初的 105.41更高,这是预期的——如此多的噪声变量的存在使得选择问题更加困难,但最终模型仍然很好,最重要的是,它成功地排除了所有 20 个噪声特征

Conceptual Overview: The “Why”

Slides cover Chapter 6: Linear Model Selection and Regularization, which is all about a fundamental trade-off in machine learning: the bias-variance trade-off. 该部分主要讨论机器学习中的一个基本权衡:偏差-方差权衡

  • The Problem (Slide ...221320.png): Imagine you have a dataset with 50 predictors (\(p=50\)). You want to predict a response \(y\). 假设你有一个包含 50 个预测变量(p=50)的数据集。你想要预测响应 \(y\)

    • Model 1 (Full Model): You use all 50 predictors. This model is very flexible. It will fit the training data extremely well, resulting in a low bias. However, it’s highly likely that many of those 50 predictors are just “noise” (random, unrelated variables). By fitting to this noise, the model will be overfit. When you show it new, unseen data (the test data), it will perform poorly. This is called high variance. 你使用了所有 50 个预测变量。这个模型非常灵活。它能很好地拟合训练数据,从而产生较低的偏差。然而,这 50 个预测变量中很可能有很多只是“噪声”(随机的、不相关的变量)。由于拟合这些噪声,模型会过拟合。当你向它展示新的、未见过的数据(测试数据)时,它的表现会很差。这被称为高方差
    • Model 2 (Subset Model): You intelligently select only the 3 predictors (\(q=3\)) that are actually related to \(y\). This model is less flexible. It won’t fit the training data as perfectly as Model 1 (it has higher bias). But, because it’s not fitting the noise, it will generalize much better to new data. It will have a much lower variance, and thus a lower overall test error. 你智能地只选择与 \(y\) 真正相关的 3 个预测变量 (\(q=3\))。这个模型的灵活性较差。它对 训练数据 的拟合度不如模型 1 完美(它的偏差更高)。但是,由于它对噪声的拟合度更高,因此对新数据的泛化能力会更好。它的方差会更低,因此总体的测试误差也会更低。
  • The Goal: The goal is to find the model that has the lowest test error. We need a formal method to find the best subset (like Model 2) without just guessing. 目标是找到测试误差**最低的模型。我们需要一个正式的方法来找到最佳子集(例如模型 2),而不是仅仅靠猜测。

  • Two Main Strategies (Slide ...221314.png):

    1. Subset Selection (Section 6.1): This is what we’re focused on. It’s an “all-or-nothing” approach. You either keep a variable in the model or you discard it completely. The “Best Subset Selection” algorithm is the most extreme, “brute-force” way to do this. 是我们关注的重点。这是一种“全有或全无”的方法。你要么在模型中“保留”一个变量,要么“彻底丢弃”它。“最佳子集选择”算法是最极端、最“暴力”的做法。

    2. Shrinkage/Regularization (Section 6.2): This is a more subtle approach (e.g., Ridge Regression, LASSO). Instead of discarding variables, you keep all \(p\) variables but add a penalty to the model that “shrinks” the coefficients (\(\beta\)) of the useless variables towards zero. 这是一种更巧妙的方法(例如,岭回归、LASSO)。你不是丢弃变量,而是保留所有 \(p\) 个变量,但会给模型添加一个惩罚项,将无用变量的系数(\(\beta\))“收缩”到零。

Questions 🎯

Q1: “How to compare which model is better?”

(From slides ...221320.png and ...221326.png)

This is the most important question. You cannot use metrics based on training data (like \(R^2\) or RSS - Residual Sum of Squares) to compare models with different numbers of predictors. 这是最重要的问题。您不能使用基于训练数据的指标(例如 R^2 或 RSS - 残差平方和)来比较具有不同数量预测变量的模型。

  • The Trap: A model with more predictors will always have a higher \(R^2\) (or lower RSS) on the data it was trained on. \(R^2\) will always increase as you add variables, even if they are pure noise. If you used \(R^2\) to compare a 3-predictor model to a 10-predictor model, the 10-predictor model would always look better on paper, even if it’s terribly overfit. 具有更多预测变量的模型在其训练数据上总是具有更高的 R^2(或更低的 RSS)。随着变量的增加,R^2 会总是增加,即使这些变量是纯噪声。如果您使用 R^2 来比较 3 个预测变量的模型和 10 个预测变量的模型,那么 10 个预测变量的模型在纸面上总是看起来更好,即使它严重过拟合。

  • The Correct Way: You must use a metric that estimates the test error. The slides and code show two ways:您必须使用一个能够估计测试误差的指标。

    1. Cross-Validation (CV): This is the method used in your Python code. It works by:
      • Splitting your training data into \(k\) “folds” (e.g., 5 folds). 将训练数据拆分成 \(k\) 个“折叠”(例如 5 个折叠)。
      • Training the model on 4 folds and testing it on the 5th fold. 使用其中 4 个折叠训练模型,并使用第 5 个折叠进行测试。
      • Repeating this 5 times, so each fold gets to be the test set once. 重复此操作 5 次,使每个折叠都作为测试集一次。
      • Averaging the 5 test errors. 对 5 个测试误差求平均值。 This gives you a robust estimate of how your model will perform on unseen data. You then choose the model with the best (lowest) average CV error. 这可以让你对模型在未见数据上的表现有一个稳健的估计。然后,你可以选择平均 CV 误差最小(最佳)的模型。
    2. Mathematical Adjustments (AIC, BIC, Adjusted \(R^2\)): These are formulas that take the training error (like RSS) and add a penalty for each predictor (\(k\)) you add.
      • \(AIC \approx RSS + 2k\sigma^2\)
      • \(BIC \approx RSS + \log(n)k\sigma^2\) A model with more predictors (larger \(k\)) gets a bigger penalty. To be chosen, a more complex model must significantly improve the RSS to overcome this penalty. 预测变量越多(k 越大)的模型,惩罚越大。要被选中,更复杂的模型必须显著提升 RSS 以克服此惩罚。

Q2: “Why using \(R^2\) for step 2?”

(From slide ...221333.png)

Step 2 of the “Best Subset Selection” algorithm says: “For \(k = 1, \dots, p\): Fit all \(\binom{p}{k}\) models… Pick the best model, that with the largest \(R^2\), … and call it \(M_k\).” “对于 \(k = 1, \dots, p\):拟合所有 \(\binom{p}{k}\) 个模型……选择具有最大 \(R^2\) 的最佳模型……并将其命名为 \(M_k\)。”

  • The Reason: In Step 2, you are only comparing models of the same size. For example, when \(k=3\), you are comparing all possible 3-predictor models: 步骤 2 中,您比较**相同大小的模型。例如,当 \(k=3\) 时,您将比较所有可能的 3 预测变量模型:
    • Model A: (\(X_1, X_2, X_3\))
    • Model B: (\(X_1, X_2, X_4\))
    • Model C: (\(X_1, X_3, X_5\))
    • …and so on.
    Since all these models have the exact same complexity (they all have \(k=3\) predictors), there is no risk of unfairly favoring a more complex model. Therefore, you are free to use a training metric like \(R^2\) (or RSS). The model with the highest \(R^2\) is, by definition, the one that best fits the training data for that specific size \(k\). 由于所有这些模型都具有完全相同的复杂度(它们都具有 \(k=3\) 个预测变量),因此不存在不公平地偏向更复杂模型的风险。因此,您可以自由使用像 \(R^2\)(或 RSS)这样的训练指标。根据定义,具有最高 \(R^2\) 的模型就是在特定大小 \(k\)与训练数据拟合度最高的模型。

Q3: “Cannot use training error in Step 3.” Why not? “步骤 3 中不能使用训练误差。” 为什么?

(From slide ...221333.png)

Step 3 says: “Select a single best model from \(M_0, M_1, \dots, M_p\) by cross validation, AIC, or BIC.”“通过交叉验证、AIC 或 BIC,从 \(M_0、M_1、\dots、M_p\) 中选择一个最佳模型。”

  • The Reason: In Step 3, you are now comparing models of different sizes. You are comparing the best 1-predictor model (\(M_1\)) vs. the best 2-predictor model (\(M_2\)) vs. the best 3-predictor model (\(M_3\)), and so on, all the way up to \(M_p\). 在步骤 3 中,您正在比较不同大小的模型。您正在比较最佳的单预测模型 (\(M_1\))、最佳的双预测模型 (\(M_2\)) 和最佳的三预测模型 (\(M_3\)),依此类推,直到 \(M_p\)

    As explained in Q1, if you used a training error metric like \(R^2\) here, the \(R^2\) would just keep going up, and you would always select the largest, most complex model, \(M_p\). This completely defeats the purpose of model selection. 如问题 1 所述,如果您在此处使用像 \(R^2\) 这样的训练误差指标,那么 \(R^2\) 会持续上升,并且您总是会选择最大、最复杂的模型 \(M_p\)。这完全违背了模型选择的目的。

    Therefore, in Step 3, you must use a method that estimates test error (like Cross-Validation) or one that penalizes for complexity (like AIC or BIC) to find the “sweet spot” model that balances fit and simplicity. 因此,在步骤 3 中,您必须使用一种估算测试误差的方法(例如交叉验证)或惩罚复杂性的方法(例如 AIC 或 BIC),以找到在拟合度和简单性之间取得平衡的“最佳点”模型。

Mathematical Deep Dive 🧮

  • \(Y = \beta_0 + \beta_1X_1 + \dots + \beta_pX_p + \epsilon\): The full linear model. The goal of subset selection is to find a subset of \(X_j\)’s where \(\beta_j \neq 0\) and set all other \(\beta\)’s to 0. 完整的线性模型。子集选择的目标是找到 \(X_j\) 的一个子集,其中 $_j 等于 0,并将所有其他 \(\beta\) 设置为 0。
  • \(2^p\) combinations: (Slide ...221333.png) This is the total number of models you have to check. For each of the \(p\) variables, you have two choices: either it is IN the model or it is OUT.这是你需要检查的模型总数。对于每个 \(p\) 个变量,你有两个选择:要么它在模型内部,要么它在模型外部
    • Example: \(p=3\) (variables \(X_1, X_2, X_3\))
    • The \(2^3 = 8\) possible models are:
      1. {} (The null model, \(M_0\))
      2. { \(X_1\) }
      3. { \(X_2\) }
      4. { \(X_3\) }
      5. { \(X_1, X_2\) }
      6. { \(X_1, X_3\) }
      7. { \(X_2, X_3\) }
      8. { \(X_1, X_2, X_3\) } (The full model, \(M_3\))
    • This is why this method is called an “exhaustive search”. It literally checks every single one. For \(p=20\), \(2^{20}\) is over a million models!这就是该方法被称为“穷举搜索”的原因。它实际上会检查每一个模型。对于 \(p=20\)\(2^{20}\) 就超过一百万个模型!
  • \(\binom{p}{k} = \frac{p!}{k!(p-k)!}\): (Slide ...221333.png) This is the “combinations” formula. It tells you how many models you fit in Step 2 for a specific \(k\).这是“组合”公式。它告诉你,对于特定的 \(k\)在步骤 2中,你拟合了 多少 个模型。
    • Example: \(p=10\) total predictors.
    • For \(k=1\): You fit \(\binom{10}{1} = 10\) models.
    • For \(k=2\): You fit \(\binom{10}{2} = \frac{10 \times 9}{2 \times 1} = 45\) models.
    • For \(k=3\): You fit \(\binom{10}{3} = \frac{10 \times 9 \times 8}{3 \times 2 \times 1} = 120\) models.
    • …and so on. The sum of all these \(\binom{p}{k}\) from \(k=0\) to \(k=p\) equals \(2^p\).

Detailed Code Analysis 💻

Your slides show Python code that applies the Best Subset Selection algorithm to a KNN Regressor. This is a great example of how the selection algorithm is independent of the model type (as mentioned in slide ...221314.png).

Key Functions

  • main()
    1. Load & Preprocess: Reads Credit.csv. The most important step here is converting categorical text (like ‘Male’/‘Female’) into numbers (1/0).
    2. Scale Data: scaler = StandardScaler() and X_scaled = scaler.fit_transform(X).
      • WHY? This is CRITICAL for KNN. KNN works by measuring distance. If ‘Income’ (e.g., 50,000) is on a vastly different scale than ‘Cards’ (e.g., 3), the ‘Income’ feature will completely dominate the distance calculation, making ‘Cards’ irrelevant. Scaling resizes all features to have a mean of 0 and standard deviation of 1, so they all contribute fairly.
    3. Handle Noisy Data (Slide ...221303.jpg): This version of the code intentionally adds 20 columns of useless, random numbers. This is a test to see if the algorithm is smart enough to ignore them.
    4. Run Selection: results_df = best_subset_selection_parallel(...). This function does all the heavy lifting (explained next).
    5. Find Best Model: results_df.sort_values(by='CV_Score', ascending=False).
      • WHY ascending=False? The code uses the metric 'neg_mean_squared_error'. This is MSE, but negative (e.g., -15000). A better model has an error closer to 0 (e.g., -10000). Since -10000 is greater than -15000, you sort in descending (high-to-low) order to put the best models at the top.
    6. Final Evaluation (Step 3): final_scores = cross_val_score(knn, X_best, y, ...)
      • This is the implementation of Step 3. It takes only the single best subset (X_best) and runs a new cross-validation on it. This gives a final, unbiased estimate of how good that one model is.
    7. Print RMSE: final_rmse = np.sqrt(-final_scores). It converts the negative MSE back into a positive RMSE (Root Mean Squared Error), which is in the same units as the target \(y\) (in this case, ‘Balance’ in dollars).
  • best_subset_selection_parallel(model, ...)
    1. This is the “manager” function. It implements the loop from Step 2.
    2. for k in range(1, n_features + 1): This is the loop “For \(k = 1, \dots, p\)”.
    3. subsets = list(combinations(feature_names, k)): This generates the \(\binom{p}{k}\) combinations for the current \(k\).
    4. results = Parallel(n_jobs=n_jobs)(...): This is a non-core, “speed-up” command. It uses the joblib library to run the evaluations on all your computer’s CPU cores at once (in parallel). Without this, checking millions of models would take days.
    5. subset_scores = ... [delayed(evaluate_subset)(...) ...] This line farms out the actual work to the evaluate_subset function for every single subset.
  • evaluate_subset(subset, ...)
    1. This is the “worker” function. It gets called thousands or millions of times.
    2. Its job is to evaluate one single subset (e.g., ('Income', 'Limit', 'Student')).
    3. X_subset = X[list(subset)]: It slices the data to get only these columns.
    4. scores = cross_val_score(model, X_subset, ...): This is the most important line. It takes the subset and performs a full 5-fold cross-validation on it.
    5. return (subset, np.mean(scores)): It returns the subset and its average CV score.

Summary of Outputs (Slides ...221255.png & ...221309.png)

  • Original Data (Slide ...221255.png):
    • Best Subset: ('Income', 'Limit', 'Rating', 'Student')
    • Final RMSE: ~105.4
  • Data with 20 “Noisy” Variables (Slide ...221309.png):
    • Best Subset: ('Income', 'Limit', 'Student')
    • Result: The algorithm successfully identified that all 20 “Noisy” variables were useless and excluded every single one of them from the best models.
    • Final RMSE: ~114.9
    • Key Takeaway: The RMSE is slightly higher, which makes sense because the selection problem was much harder. But the method worked perfectly. It filtered all the “noise” and found a simple, powerful model, just as the theory on slide ...221320.png predicted.

2. The Core Problem: Training Error vs. Test Error 核心问题:训练误差 vs. 测试误差

The central theme of these slides is finding the “best” model. The problem is that a model with more predictors (more complex) will always fit the data it was trained on better. This is a trap. 寻找“最佳”模型。问题在于,预测因子越多(越复杂)的模型总是能更好地拟合训练数据。这是一个陷阱。

  • Training Error: How well the model fits the data we used to build it. \(R^2\) and \(RSS\) measure this. 模型与我们构建模型时所用数据的拟合程度。\(R^2\)\(RSS\) 衡量了这一点。
  • Test Error: How well the model predicts new, unseen data. This is what we actually care about. A model that is too complex (e.g., has 10 predictors when only 3 are useful) will have low training error but very high test error. This is called overfitting. 模型预测新的、未见过的数据的准确程度。这才是我们真正关心的。过于复杂的模型(例如,有 10 个预测因子,但只有 3 个有用)的训练误差会很低,但测试误差会很高。这被称为过拟合

The goal is to choose a model that has the lowest test error. The metrics below (Adjusted \(R^2\), AIC, BIC) are all attempts to estimate this test error without having to actually collect new data. They do this by adding a penalty for complexity. 目标是选择一个具有最低测试误差的模型。以下指标(调整后的 \(R^2\)、AIC、BIC)都是在无需实际收集新数据的情况下尝试估计此测试误差。他们通过增加复杂度惩罚来实现这一点。

Basic Metrics (Measures of Fit)

These formulas from slide 13 describe how well a model fits the training data.

Residue (Error) 残差(误差)

  • Formula: \(\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_j x_{ij}\)
  • Concept: This is the most basic building block. It’s the difference between the actual observed value (\(y_i\)) and the value your model predicted (\(\hat{y}_i\)). It is the “error” for a single data point. 这是最基本的构建块。它是实际观测值 (\(y_i\)) 与模型*预测值 (\(\hat{y}_i\)) 之间的差值。它是单个数据点的“误差”。

Residual Sum of Squares (RSS) 残差平方和 (RSS)

  • Formula: \(RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^2\)
  • Concept: This is the overall measure of model error. You square all the individual errors (residues) to make them positive and then add them all up. 这是模型误差的总体度量。将所有单个误差(残差)平方,使其为正,然后将它们全部相加。
  • Goal: The entire process of linear regression (called “Ordinary Least Squares”) is designed to find the \(\hat{\beta}\) coefficients that make this RSS value as small as possible. 整个线性回归过程(称为“普通最小二乘法”)旨在找到使RSS 值尽可能小\(\hat{\beta}\) 个系数。
  • The Flaw 缺陷: \(RSS\) will always decrease (or stay the same) as you add more predictors (\(p\)). A model with all 10 predictors will have a lower \(RSS\) than a model with 9, even if that 10th predictor is useless. Therefore, \(RSS\) is useless for choosing between models of different sizes. 随着预测变量 (\(p\)) 的增加,\(RSS\) 总是会减小(或保持不变)。一个包含所有 10 个预测变量的模型的 \(RSS\) 会低于一个包含 9 个预测变量的模型,即使第 10 个预测变量毫无用处。因此,\(RSS\) 对于在不同规模的模型之间进行选择毫无用处。

R-squared (\(R^2\))

  • Formula: \(R^2 = 1 - \frac{SS_{error}}{SS_{total}} = 1 - \frac{RSS}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\)
  • Concept: This metric reframes \(RSS\) into a more interpretable percentage.此指标将 \(RSS\) 重新定义为更易于解释的百分比。
    • \(SS_{total}\) (the denominator) represents the total variance of the data. It’s the error you would get if your “model” was just guessing the average value (\(\bar{y}\)) for every single observation. (分母)表示数据的总方差。如果你的“模型”只是猜测每个观测值的平均值 (\(\bar{y}\)),那么你就会得到这个误差。
    • \(SS_{error}\) (the \(RSS\)) is the error after using your model. 是“模型解释的总方差的比例”。 \(R^2\) 为 0.75 意味着你的模型可以解释响应变量 75% 的变异。
    • \(R^2\) is the “proportion of total variance explained by the model.” An \(R^2\) of 0.75 means your model can explain 75% of the variation in the response variable.
  • The Flaw 缺陷: Just like \(RSS\), \(R^2\) will always increase (or stay the same) as you add more predictors. This is visually confirmed in Figure 6.1, where the red line for \(R^2\) only goes up. It will always pick the most complex model. 与 \(RSS\) 一样,随着预测变量的增加,\(R^2\)始终增加(或保持不变)。图 6.1 直观地证实了这一点,其中 \(R^2\) 的红线只会上升。它总是会选择最复杂的模型。

Advanced Metrics (For Model Selection) 高级指标(用于模型选择)

These metrics “fix” the flaw of \(R^2\) by including a penalty for the number of predictors.

Adjusted \(R^2\)

  • Formula: \[ \text{Adjusted } R^2 = 1 - \frac{RSS / (n - p - 1)}{SS_{total} / (n - 1)} \]
  • Mathematical Concept: This formula replaces the “Sum of Squares” (\(SS\)) with “Mean Squares” (\(MS\)).
    • \(MS_{error} = \frac{RSS}{n-p-1}\)
    • \(MS_{total} = \frac{SS_{total}}{n-1}\)
  • The “Penalty” Explained: The penalty is degrees of freedom.
    • \(n\) = number of data points.
    • \(p\) = number of predictors.
    • The term \(n-p-1\) is the degrees of freedom for the residuals. You start with \(n\) data points, but you “use up” one degree of freedom to estimate the intercept (\(\hat{\beta}_0\)) and \(p\) more to estimate the \(p\) slopes.
  • How it Works:
    1. When you add a new predictor (increase \(p\)), \(RSS\) goes down, which makes the numerator (\(MS_{error}\)) smaller.
    2. …But, increasing \(p\) also decreases the denominator (\(n-p-1\)), which makes the numerator (\(MS_{error}\)) larger.
    • This creates a “tug-of-war.” If the new predictor is useful, it will drop \(RSS\) a lot, and Adjusted \(R^2\) will increase. If the new predictor is useless, \(RSS\) will barely change, and the penalty from decreasing the denominator will win, causing Adjusted \(R^2\) to decrease.
  • Goal: You select the model with the highest Adjusted \(R^2\).

Akaike Information Criterion (AIC)

  • General Formula: \(AIC = -2 \log \ell(\hat{\theta}) + 2d\)
  • Concept Breakdown:
    • \(\ell(\hat{\theta})\): This is the Maximized Likelihood Function.
      • The Likelihood Function \(\ell(\theta)\) asks: “Given a set of model parameters \(\theta\), how probable is the data we observed?”
      • The Maximum Likelihood Estimate (MLE) \(\hat{\theta}\) is the specific set of parameters (the \(\hat{\beta}\)’s) that maximizes this probability.
    • \(\log \ell(\hat{\theta})\): The log-likelihood. This is just a number that represents the best possible fit the model can achieve for the data. A higher number is a better fit.
    • \(-2 \log \ell(\hat{\theta})\): This is the Deviance. Since a higher log-likelihood is better, a lower deviance is better. This term measures poorness-of-fit.
    • \(d\): The number of parameters estimated by the model. (e.g., \(p\) predictors + 1 intercept).
    • \(2d\): This is the Penalty Term.
  • How it Works: \(AIC = (\text{Poorness-of-Fit}) + (\text{Complexity Penalty})\). As you add predictors, the fit gets better (the deviance term goes down), but the penalty term (\(2d\)) goes up.
  • Goal: You select the model with the lowest AIC.

Bayesian Information Criterion (BIC)

  • General Formula: \(BIC = -2 \log \ell(\hat{\theta}) + \log(n)d\)
  • Concept: This is mathematically identical to AIC, but the penalty term is different.
    • AIC Penalty: \(2d\)
    • BIC Penalty: \(\log(n)d\)
  • Comparison:
    • \(n\) is the number of observations in your dataset.
    • As long as your dataset has 8 or more observations (\(n \ge 8\)), \(\log(n)\) will be greater than 2.
    • This means BIC applies a much harsher penalty for complexity than AIC.
  • Consequence: BIC will tend to choose simpler models (fewer predictors) than AIC.
  • Goal: You select the model with the lowest BIC.

The Deeper Theory: Why AIC Works

Slide 27 (“Understanding AIC”) gives the deep mathematical justification.

  • Goal: We have a true, unknown process \(p\) that generates our data. We are creating a model \(\hat{p}_j\). We want our model to be as “close” to the truth as possible.
  • Kullback-Leibler (K-L) Distance: This is a function \(K(p, \hat{p}_j)\) that measures the “information lost” when you use your model \(\hat{p}_j\) to approximate the truth \(p\). You want to minimize this distance.
  • The Math:
    1. \(K(p, \hat{p}_j) = \int p(y) \log \left( \frac{p(y)}{\hat{p}_j(y)} \right) dy\)
    2. This splits into: \(K(p, \hat{p}_j) = \underbrace{\int p(y) \log(p(y)) dy}_{\text{Constant}} - \underbrace{\int p(y) \log(\hat{p}_j(y)) dy}_{\text{This is what we need to maximize}}\)
  • The Problem: We can’t calculate that second term because it requires knowing the true function \(p\).
  • Akaike’s Insight: Akaike proved that the log-likelihood we can calculate, \(\log \ell(\hat{\theta})\), is a biased estimator of that target. He also proved that the bias is approximately \(-d\).
  • The Solution: An unbiased estimate of the target is \(\log \ell(\hat{\theta}) - d\).
  • Final Step: For historical and statistical reasons, he multiplied this by \(-2\) to create the final AIC formula.
  • Conclusion: AIC is not just a random formula. It is a carefully derived estimate of how much information your model loses compared to the “truth” (i.e., its expected performance on new data).

AIC/BIC for Linear Regression

Slide 26 shows how these general formulas simplify for linear regression (assuming normal, Gaussian errors).

  • General Formula: \(AIC = -2 \log \ell(\hat{\theta}) + 2d\)
  • Linear Regression Formula: \(AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)\)

Key Insight: For linear regression, the “poorness-of-fit” term (\(-2 \log \ell(\hat{\theta})\)) is directly proportional to the \(RSS\).

This makes it much easier to understand. You can just think of the formulas as: * AIC \(\approx\) \(RSS + 2d\hat{\sigma}^2\) * BIC \(\approx\) \(RSS + \log(n)d\hat{\sigma}^2\)

(Here \(\hat{\sigma}^2\) is an estimate of the error variance, which can often be treated as a constant).

This clearly shows the trade-off: We want a model with a low \(RSS\) (good fit) and a low \(d\) (low complexity). These two goals are in direct competition.

Mallow’s \(C_p\): The slide notes that \(C_p\) is equivalent to AIC for linear regression. The \(C_p\) formula is \(C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2_{full})\), where \(\hat{\sigma}^2_{full}\) is the error variance estimated from the full model. Since \(n\) and \(\hat{\sigma}^2_{full}\) are constants, minimizing \(C_p\) is mathematically identical to minimizing \(RSS + 2d\hat{\sigma}^2_{full}\), which is the same logic as AIC.

Here is a detailed breakdown of the mathematical formulas and concepts from your slides.

The Core Problem: Training Error vs. Test Error

The central theme of these slides is finding the “best” model. The problem is that a model with more predictors (more complex) will always fit the data it was trained on better. This is a trap.

  • Training Error: How well the model fits the data we used to build it. \(R^2\) and \(RSS\) measure this.
  • Test Error: How well the model predicts new, unseen data. This is what we actually care about. A model that is too complex (e.g., has 10 predictors when only 3 are useful) will have low training error but very high test error. This is called overfitting.

The goal is to choose a model that has the lowest test error. The metrics below (Adjusted \(R^2\), AIC, BIC) are all attempts to estimate this test error without having to actually collect new data. They do this by adding a penalty for complexity.

Basic Metrics (Measures of Fit)

These formulas from slide 13 describe how well a model fits the training data.

Residue (Error)

  • Formula: \(\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_j x_{ij}\)
  • Concept: This is the most basic building block. It’s the difference between the actual observed value (\(y_i\)) and the value your model predicted (\(\hat{y}_i\)). It is the “error” for a single data point.

Residual Sum of Squares (RSS)

  • Formula: \(RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^2\)
  • Concept: This is the overall measure of model error. You square all the individual errors (residues) to make them positive and then add them all up.
  • Goal: The entire process of linear regression (called “Ordinary Least Squares”) is designed to find the \(\hat{\beta}\) coefficients that make this RSS value as small as possible.
  • The Flaw: \(RSS\) will always decrease (or stay the same) as you add more predictors (\(p\)). A model with all 10 predictors will have a lower \(RSS\) than a model with 9, even if that 10th predictor is useless. Therefore, \(RSS\) is useless for choosing between models of different sizes.

R-squared (\(R^2\))

  • Formula: \(R^2 = 1 - \frac{SS_{error}}{SS_{total}} = 1 - \frac{RSS}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\)
  • Concept: This metric reframes \(RSS\) into a more interpretable percentage.
    • \(SS_{total}\) (the denominator) represents the total variance of the data. It’s the error you would get if your “model” was just guessing the average value (\(\bar{y}\)) for every single observation.
    • \(SS_{error}\) (the \(RSS\)) is the error after using your model.
    • \(R^2\) is the “proportion of total variance explained by the model.” An \(R^2\) of 0.75 means your model can explain 75% of the variation in the response variable.
  • The Flaw: Just like \(RSS\), \(R^2\) will always increase (or stay the same) as you add more predictors. This is visually confirmed in Figure 6.1, where the red line for \(R^2\) only goes up. It will always pick the most complex model.

Advanced Metrics (For Model Selection)

These metrics “fix” the flaw of \(R^2\) by including a penalty for the number of predictors.

Adjusted \(R^2\)

  • Formula: \[ \text{Adjusted } R^2 = 1 - \frac{RSS / (n - p - 1)}{SS_{total} / (n - 1)} \]
  • Mathematical Concept: This formula replaces the “Sum of Squares” (\(SS\)) with “Mean Squares” (\(MS\)).
    • \(MS_{error} = \frac{RSS}{n-p-1}\)
    • \(MS_{total} = \frac{SS_{total}}{n-1}\)
  • The “Penalty” Explained: The penalty is degrees of freedom.
    • \(n\) = number of data points.
    • \(p\) = number of predictors.
    • The term \(n-p-1\) is the degrees of freedom for the residuals. You start with \(n\) data points, but you “use up” one degree of freedom to estimate the intercept (\(\hat{\beta}_0\)) and \(p\) more to estimate the \(p\) slopes.
  • How it Works:
    1. When you add a new predictor (increase \(p\)), \(RSS\) goes down, which makes the numerator (\(MS_{error}\)) smaller.
    2. …But, increasing \(p\) also decreases the denominator (\(n-p-1\)), which makes the numerator (\(MS_{error}\)) larger.
    • This creates a “tug-of-war.” If the new predictor is useful, it will drop \(RSS\) a lot, and Adjusted \(R^2\) will increase. If the new predictor is useless, \(RSS\) will barely change, and the penalty from decreasing the denominator will win, causing Adjusted \(R^2\) to decrease.
  • Goal: You select the model with the highest Adjusted \(R^2\).

Akaike Information Criterion (AIC)

  • General Formula: \(AIC = -2 \log \ell(\hat{\theta}) + 2d\)
  • Concept Breakdown:
    • \(\ell(\hat{\theta})\): This is the Maximized Likelihood Function.
      • The Likelihood Function \(\ell(\theta)\) asks: “Given a set of model parameters \(\theta\), how probable is the data we observed?”
      • The Maximum Likelihood Estimate (MLE) \(\hat{\theta}\) is the specific set of parameters (the \(\hat{\beta}\)’s) that maximizes this probability.
    • \(\log \ell(\hat{\theta})\): The log-likelihood. This is just a number that represents the best possible fit the model can achieve for the data. A higher number is a better fit.
    • \(-2 \log \ell(\hat{\theta})\): This is the Deviance. Since a higher log-likelihood is better, a lower deviance is better. This term measures poorness-of-fit.
    • \(d\): The number of parameters estimated by the model. (e.g., \(p\) predictors + 1 intercept).
    • \(2d\): This is the Penalty Term.
  • How it Works: \(AIC = (\text{Poorness-of-Fit}) + (\text{Complexity Penalty})\). As you add predictors, the fit gets better (the deviance term goes down), but the penalty term (\(2d\)) goes up.
  • Goal: You select the model with the lowest AIC.

Bayesian Information Criterion (BIC)

  • General Formula: \(BIC = -2 \log \ell(\hat{\theta}) + \log(n)d\)
  • Concept: This is mathematically identical to AIC, but the penalty term is different.
    • AIC Penalty: \(2d\)
    • BIC Penalty: \(\log(n)d\)
  • Comparison:
    • \(n\) is the number of observations in your dataset.
    • As long as your dataset has 8 or more observations (\(n \ge 8\)), \(\log(n)\) will be greater than 2.
    • This means BIC applies a much harsher penalty for complexity than AIC.
  • Consequence: BIC will tend to choose simpler models (fewer predictors) than AIC.
  • Goal: You select the model with the lowest BIC.

The Deeper Theory: Why AIC Works

Slide 27 (“Understanding AIC”) gives the deep mathematical justification.

  • Goal: We have a true, unknown process \(p\) that generates our data. We are creating a model \(\hat{p}_j\). We want our model to be as “close” to the truth as possible.
  • Kullback-Leibler (K-L) Distance: This is a function \(K(p, \hat{p}_j)\) that measures the “information lost” when you use your model \(\hat{p}_j\) to approximate the truth \(p\). You want to minimize this distance.
  • The Math:
    1. \(K(p, \hat{p}_j) = \int p(y) \log \left( \frac{p(y)}{\hat{p}_j(y)} \right) dy\)
    2. This splits into: \(K(p, \hat{p}_j) = \underbrace{\int p(y) \log(p(y)) dy}_{\text{Constant}} - \underbrace{\int p(y) \log(\hat{p}_j(y)) dy}_{\text{This is what we need to maximize}}\)
  • The Problem: We can’t calculate that second term because it requires knowing the true function \(p\).
  • Akaike’s Insight: Akaike proved that the log-likelihood we can calculate, \(\log \ell(\hat{\theta})\), is a biased estimator of that target. He also proved that the bias is approximately \(-d\).
  • The Solution: An unbiased estimate of the target is \(\log \ell(\hat{\theta}) - d\).
  • Final Step: For historical and statistical reasons, he multiplied this by \(-2\) to create the final AIC formula.
  • Conclusion: AIC is not just a random formula. It is a carefully derived estimate of how much information your model loses compared to the “truth” (i.e., its expected performance on new data).

AIC/BIC for Linear Regression

Slide 26 shows how these general formulas simplify for linear regression (assuming normal, Gaussian errors).

  • General Formula: \(AIC = -2 \log \ell(\hat{\theta}) + 2d\)
  • Linear Regression Formula: \(AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)\)

Key Insight: For linear regression, the “poorness-of-fit” term (\(-2 \log \ell(\hat{\theta})\)) is directly proportional to the \(RSS\).

This makes it much easier to understand. You can just think of the formulas as: * AIC \(\approx\) \(RSS + 2d\hat{\sigma}^2\) * BIC \(\approx\) \(RSS + \log(n)d\hat{\sigma}^2\)

(Here \(\hat{\sigma}^2\) is an estimate of the error variance, which can often be treated as a constant).

This clearly shows the trade-off: We want a model with a low \(RSS\) (good fit) and a low \(d\) (low complexity). These two goals are in direct competition.

Mallow’s \(C_p\): The slide notes that \(C_p\) is equivalent to AIC for linear regression. The \(C_p\) formula is \(C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2_{full})\), where \(\hat{\sigma}^2_{full}\) is the error variance estimated from the full model. Since \(n\) and \(\hat{\sigma}^2_{full}\) are constants, minimizing \(C_p\) is mathematically identical to minimizing \(RSS + 2d\hat{\sigma}^2_{full}\), which is the same logic as AIC.

3. Variable Selection

Core Concept: The Problem of Variable Selection

In regression, we want to model a response variable \(Y\) using a set of \(p\) predictor variables \(X_1, X_2, ..., X_p\).

  • The “Kitchen Sink” Problem: A common temptation is to include all available predictors in the model: \[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon\] This often leads to overfitting. The model may fit the training data well but will perform poorly on new, unseen data. It’s also hard to interpret a model with dozens of predictors.

  • The Solution: Subset Selection. The goal is to find a smaller subset of the predictors that builds a model that is:

    1. Accurate: Has low prediction error.
    2. Parsimonious: Uses the fewest predictors necessary.
    3. Interpretable: Is simple enough for a human to understand.

Your slides present two main methods to achieve this: Best Subset Selection and Forward Stepwise Selection.

Method 1: Best Subset Selection (BSS)

This is the “brute force” approach. It considers every single possible model.

Conceptual Algorithm

  1. Fit all models with \(k=1\) predictor (there are \(p\) of these). Find the best one (lowest RSS) and call it \(M_1\).
  2. Fit all models with \(k=2\) predictors (there are \(\binom{p}{2}\) of these). Find the best one and call it \(M_2\).
  3. Fit the one model with \(k=p\) predictors (the full model), \(M_p\).
  4. You now have a list of \(p\) “best” models: \(M_1, M_2, ..., M_p\).
  5. Use a selection criterion (like Adjusted \(R^2\), BIC, AIC, or \(C_p\)) to choose the single best model from this list.

Mathematical & Computational Cost (from slide 225641.png)

  • For each predictor, there are two possibilities: it’s either IN the model or OUT.
  • With \(p\) predictors, the total number of models to test is \(2 \times 2 \times ... \times 2\) (\(p\) times).
  • Total Models = \(2^p\)
  • This is a “combinatorial explosion.” As the slide notes, if \(p=20\), \(2^{20} = 1,048,576\) models. This is computationally infeasible for large \(p\).

Method 2: Forward Stepwise Selection (FSS)

This is a “greedy” algorithm. It’s an efficient alternative to BSS that does not test every model.

Conceptual Algorithm (from slides 225645.png & 225648.png)

  • Step 1: Start with the null model, \(M_0\), which has no predictors. \[M_0: Y = \beta_0 + \epsilon\] The prediction is just the sample mean of \(Y\).

  • Step 2 (Iterative):

    • For \(k=0\) (to get \(M_1\)): Fit all \(p\) models that add one predictor to \(M_0\). Choose the best one (lowest RSS or highest \(R^2\)). This is \(M_1\). Let’s say it contains \(X_1\).
    • For \(k=1\) (to get \(M_2\)): Keep \(X_1\) in the model. Fit all \(p-1\) models that add one more predictor to \(M_1\) (e.g., \(M_1+X_2\), \(M_1+X_3\), …). Choose the best of these. This is \(M_2\).
    • Repeat: Continue this process, adding one variable at a time, until all \(p\) predictors are in the model \(M_p\).
  • Step 3: You now have a sequence of \(p+1\) models: \(M_0, M_1, ..., M_p\). Choose the single best model from this sequence using Adjusted \(R^2\), AIC, BIC, or \(C_p\).

Mathematical & Computational Cost (from slide 225651.png)

  • To find \(M_1\), you fit \(p\) models.
  • To find \(M_2\), you fit \(p-1\) models.
  • To find \(M_p\), you fit \(1\) model.
  • The null model \(M_0\) is 1 model.
  • Total Models = \(1 + \sum_{k=0}^{p-1} (p-k) = 1 + p + (p-1) + ... + 1 = 1 + \frac{p(p+1)}{2}\)
  • As the slide notes, if \(p=20\), this is only \(1 + 20(21)/2 = 211\) models. This is vastly more efficient than BSS.
  • Key weakness: The method is “greedy.” If it adds \(X_1\) in Step 1, it can never be removed. It’s possible the true best 2-variable model is \((X_2, X_3)\), but if FSS chose \(X_1\) as the best 1-variable model, it will never find \((X_2, X_3)\).

4. How to Choose the “Best” Model: The Criteria

You can’t use RSS or \(R^2\) to compare models with different numbers of predictors (\(k\)). This is because RSS always decreases (and \(R^2\) always increases) as you add more variables. You must use a criterion that penalizes complexity.

  • RSS (Residual Sum of Squares): Goal is to minimize. \[RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] Good for comparing models of the same size \(k\).

  • Adjusted R-squared (\(Adj. R^2\)): Goal is to maximize. \[Adj. R^2 = 1 - \frac{(1-R^2)(n-1)}{n-p-1}\] This “adjusts” \(R^2\) by adding a penalty for having more predictors (\(p\)). Adding a useless predictor will make \(Adj. R^2\) go down.

  • Mallow’s \(C_p\): Goal is to minimize. \[C_p \approx \frac{1}{n}(RSS + 2p\hat{\sigma}^2)\] Here, \(\hat{\sigma}^2\) is an estimate of the error variance from the full model (with all \(p\) predictors). A good model will have \(C_p \approx p\).

  • AIC (Akaike Information Criterion) & BIC (Bayesian Information Criterion): Goal is to minimize. \[AIC = 2p - 2\ln(\hat{L})\] \[BIC = p\ln(n) - 2\ln(\hat{L})\] Here, \(\hat{L}\) is the maximized likelihood of the model. You don’t need to calculate this by hand; software provides it.

    • Key difference: BIC’s penalty for \(p\) is \(p\ln(n)\), while AIC’s is \(2p\). Since \(\ln(n)\) is almost always \(> 2\) (for \(n>7\)), BIC applies a much heavier penalty for complexity.
    • This means BIC tends to choose smaller, more parsimonious models than AIC or \(Adj. R^2\).

5. Python Code Analysis (Slide 225546.jpg)

This slide shows the Python code for Best Subset Selection (BSS).

1
2
3
4
5
6
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from itertools import combinations # <-- This is the BSS engine

Block 1: Load the Credit dataset

1
2
3
4
5
6
7
8
9
10
# 1. Load the Credit dataset
Credit = pd.read_csv('Credit.csv')
Credit['ID'] = Credit['ID'].astype(str)
(num_samples, num_predictors) = Credit.shape

# Convert categorical text data to numerical (dummy variables)
Credit['Gender'] = Credit['Gender'].map({'Male': 1, 'Female': 0})
Credit['Student'] = Credit['Student'].map({'Yes': 1, 'No': 0})
Credit['Married'] = Credit['Married'].map({'Yes': 1, 'No': 0})
Credit['Ethnicity'] = Credit['Ethnicity'].map({'Asian': 1, 'Caucasian': 1, 'African American': 0})
  • pd.read_csv: Reads the data into a pandas DataFrame.
  • .map(): This is a crucial preprocessing step. Regression models require numbers, not text like ‘Yes’ or ‘Male’. This line converts those strings into 1s and 0s.

Block 2: Plot scatterplot matrix

1
2
3
4
5
6
# 2. Plot scatterplot matrix
selected_columns = ['Balance', 'Education', 'Age', 'Cards', 'Rating', 'Limit', 'Income']
sns.set(style="ticks")
sns.pairplot(Credit[selected_columns], diag_kind='kde')
plt.suptitle('Scatterplot Matrix', y=1.02)
plt.show()
  • sns.pairplot: A powerful visualization from the seaborn library. The resulting plot (right side of the slide) is a grid.
    • Diagonal plots (kde): Show the distribution (Kernel Density Estimate) of a single variable (e.g., ‘Balance’ is skewed right).
    • Off-diagonal plots (scatter): Show the relationship between two variables (e.g., ‘Limit’ and ‘Rating’ are almost perfectly linear). This helps you visually spot potentially strong predictors.

Block 3: Best Subset Selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 3. Best Subset Selection
# (This code is incomplete on the slide, I'll fill in the logic)

# Define target and predictors
target = 'Balance'
predictors = [col for col in Credit.columns if col != target]
nvmax = 10 # Max number of predictors to test (up to 10)

# Initialize lists to store model statistics
model_stats = []

# Iterate over number of predictors from 1 to nvmax
for k in range(1, nvmax + 1):

# Generate all possible combinations of predictors of size k
# This is the core of BSS
for subset in list(combinations(predictors, k)):

# Get the design matrix (X)
X_subset = Credit[list(subset)]

# Add a constant (intercept) term to the model
# Y = B0 + B1*X1 -> statsmodels needs B0 to be added manually
X_subset_const = sm.add_constant(X_subset)

# Get the target variable (y)
y_target = Credit[target]

# Fit the Ordinary Least Squares (OLS) model
model = sm.OLS(y_target, X_subset_const).fit()

# Calculate RSS
RSS = ((model.resid) ** 2).sum()

# (The full code would also calculate R-squared, Adj. R-sq, BIC, etc. here)
# model_stats.append({'k': k, 'subset': subset, 'RSS': RSS, ...})
  • for k in range(1, nvmax + 1): This is the outer loop that iterates from \(k=1\) (1 predictor) to \(k=10\) (10 predictors).
  • list(combinations(predictors, k)): This is the inner loop and the most important line. The itertools.combinations function is a highly efficient way to generate all unique subsets.
    • When \(k=1\), it returns [('Income',), ('Limit',), ('Rating',), ...].
    • When \(k=2\), it returns [('Income', 'Limit'), ('Income', 'Rating'), ('Limit', 'Rating'), ...].
    • This is what generates the \(2^p\) (or in this case, \(\sum_{k=1}^{10} \binom{p}{k}\)) models to test.
  • sm.add_constant(X_subset): Your regression equation is \(Y = \beta_0 + \beta_1X_1\). The \(X_1\) is your X_subset. The sm.add_constant function adds a column of 1s to your data, which allows the statsmodels library to estimate the \(\beta_0\) (intercept) term.
  • sm.OLS(y_target, X_subset_const).fit(): This fits the Ordinary Least Squares (OLS) model, which finds the \(\beta\) coefficients that minimize the RSS.
  • model.resid: This attribute of the fitted model contains the residuals (\(e_i = y_i - \hat{y}_i\)) for each data point.
  • ((model.resid) ** 2).sum(): This line is the direct code implementation of the formula \(RSS = \sum e_i^2\).

Synthesizing the Results (The Plots)

After running the BSS code, you get the data used in the plots and the table.

  • Image 225550.png (Adjusted R-squared)

    • Goal: Maximize.
    • What it shows: The gray dots are all the models tested for each \(k\). The red line connects the single best model for each \(k\).
    • Conclusion: The plot shows a sharp “elbow.” The \(Adj. R^2\) increases dramatically up to \(k=4\), then increases very slowly. The maximum is around \(k=6\) or \(k=7\), but the gain after \(k=4\) is minimal.
  • Image 225554.png (BIC)

    • Goal: Minimize.
    • What it shows: BIC heavily penalizes complexity.
    • Conclusion: The plot shows a very clear minimum. The BIC value plummets from \(k=2\) to \(k=3\) and hits its lowest point at \(k=4\). After \(k=4\), the penalty for adding more variables is larger than the benefit in model fit, so the BIC score starts to rise. This is a very strong vote for the 4-predictor model.
  • Image 225635.png (Mallow’s \(C_p\))

    • Goal: Minimize.
    • What it shows: A very similar story to BIC.
    • Conclusion: The \(C_p\) value drops significantly and hits its minimum at \(k=4\).
  • Image 225638.png (Summary Table)

    • This is the most important image for the final conclusion. It summarizes the red line from all the plots.
    • Look at the row for Num_Predictors = 4. The predictors are (Income, Limit, Cards, Student).
    • Now look at the columns for BIC and Cp.
      • BIC: 4841.615607. This is the lowest value in the entire BIC column (the value at \(k=3\) is 4865.352851).
      • Cp: 7.122228. This is also the lowest value in the Cp column.
    • The Adj_R_squared at \(k=4\) is 0.953580, which is very close to its maximum of ~0.954 at \(k=7-10\).

Final Conclusion: All three “penalized” criteria (Adjusted \(R^2\), BIC, and \(C_p\)) point to the same conclusion. While \(Adj. R^2\) is a bit ambiguous, BIC and \(C_p\) provide a clear signal that the best, most parsimonious model is the 4-predictor model using Income, Limit, Cards, and Student.

4. Subset Selection

Summary of Subset Selection

These slides introduce subset selection, a process in statistical learning used to identify the best subset of predictors (variables) for a regression model. The goal is to find a model that has low prediction error and avoids overfitting by excluding irrelevant variables.

The slides cover two main “greedy” (stepwise) algorithms and the criteria used to select the final best model.

Stepwise Selection Algorithms

Instead of testing all \(2^p\) possible models (which is “best subset selection” and computationally unfeasible), stepwise methods build a single path of models.

Forward Stepwise Selection

This is an additive (bottom-up) approach:

  1. Start with the null model (no predictors).
  2. Find the best 1-variable model (the one that gives the lowest Residual Sum of Squares, or RSS).
  3. Add the single variable that, when added to the current model, results in the new best model (lowest RSS).
  4. Repeat this process until all \(p\) predictors are in the model.
  5. This generates a sequence of \(p+1\) models, from \(\mathcal{M}_0\) to \(\mathcal{M}_p\).

Backward Stepwise Selection

This is a subtractive (top-down) approach:

  1. Start with the full model containing all \(p\) predictors.
  2. Find the best \((p-1)\)-variable model by removing the single variable that results in the lowest RSS (or highest \(R^2\)). This variable is considered the least significant.
  3. Remove the next variable that, when removed from the current best model, gives the new best model.
  4. Repeat until only the null model remains.
  5. This also generates a sequence of \(p+1\) models.

Pros and Cons (Backward Selection)

  • Pro: Computationally efficient compared to best subset. It fits \(1 + \sum_{k=0}^{p-1}(p-k) = \mathbf{1 + p(p+1)/2}\) models, which is much less than \(2^p\). (e.g., for \(p=20\), it’s 211 models vs. >1 million).
  • Con: Cannot be used if \(p > n\) (more predictors than observations), because the initial full model cannot be fit.
  • Con (for both): These methods are greedy. A variable added in forward selection is never removed, and a variable removed in backward selection is never added back. This means they are not guaranteed to find the true best model.

Choosing the Final Best Model

Both forward and backward selection give you a set of candidate models (e.g., the best 1-variable model, best 2-variable model, etc.). You must then choose the single best one. The slides show two main approaches:

A. Direct Error Estimation

Use a validation set or cross-validation (CV) to estimate the test error for each model (e.g., the 1-variable, 2-variable… models). Choose the model with the lowest estimated test error.

B. Adjusted Metrics (Penalizing for Complexity)

Standard RSS and \(R^2\) will always improve as you add variables, leading to overfitting. Instead, use metrics that penalize the model for having too many predictors.

  • Mallows’ \(C_p\): An estimate of test Mean Squared Error (MSE). \[C_p = \frac{1}{n} (RSS + 2d\hat{\sigma}^2)\] (where \(d\) is the number of predictors, and \(\hat{\sigma}^2\) is an estimate of the error variance). You want to find the model with the minimum \(C_p\).

  • BIC (Bayesian Information Criterion): \[BIC = \frac{1}{n} (RSS + \log(n)d\hat{\sigma}^2)\] BIC’s penalty \(\log(n)\) is stronger than \(C_p\)’s (or AIC’s) penalty of \(2\), so it tends to select smaller (more parsimonious) models. You want to find the model with the minimum BIC.

  • Adjusted \(R^2\): \[R^2_{adj} = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}\] (where \(TSS\) is the Total Sum of Squares). Unlike \(R^2\), this metric can decrease if adding a variable doesn’t help enough. You want to find the model with the maximum Adjusted \(R^2\).

Python Code Understanding

The slides use the regsubsets() function from the leaps package in R.

1
2
3
4
5
6
# R Code from slides
library(leaps)
# Forward Selection
regfit.fwd <- regsubsets(Balance~., data=Credit, method="forward", nvmax=11)
# Backward Selection
regfit.bwd <- regsubsets(Balance~., data=Credit, method="backward", nvmax=11)

In Python, the standard tool for this is SequentialFeatureSelector from scikit-learn.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split

# Assume 'Credit' is a pandas DataFrame with 'Balance' as the target
X = Credit.drop('Balance', axis=1)
y = Credit['Balance']

# Initialize the linear regression estimator
model = LinearRegression()

# --- Forward Selection ---
# direction='forward' starts with 0 features and adds them
# To get the best 4-variable model, for example:
sfs_forward = SequentialFeatureSelector(
model,
n_features_to_select=4,
direction='forward',
cv=None # Or use cross-validation, e.g., cv=10
)
sfs_forward.fit(X, y)
print("Forward selection best 4 features:")
print(sfs_forward.get_feature_names_out())


# --- Backward Selection ---
# direction='backward' starts with all features and removes them
sfs_backward = SequentialFeatureSelector(
model,
n_features_to_select=4,
direction='backward',
cv=None
)
sfs_backward.fit(X, y)
print("\nBackward selection best 4 features:")
print(sfs_backward.get_feature_names_out())

# Note: To replicate the plots, you would loop this process,
# changing 'n_features_to_select' from 1 to p,
# record the model scores (e.g., RSS, AIC, BIC) at each step,
# and then plot the results.

Important Images

  1. Slide ...230014.png (Forward Selection Plots) & ...230036.png (Backward Selection Plots):

    • What they are: These \(2 \times 2\) plot grids are the most important visuals. They show Residual Sum of Squares (RSS), Adjusted \(R^2\), BIC, and Mallows’ \(C_p\) plotted against the Number of Variables.
    • Why they’re important: They are the decision-making tool. You use these plots to choose the best model.
      • You look for the “elbow” or minimum value for BIC and \(C_p\).
      • You look for the “peak” or maximum value for Adjusted \(R^2\).
      • (RSS is not used for selection as it always decreases).
  2. Slide ...230040.png (Find the best model):

    • What it is: This slide shows a close-up of the \(C_p\), BIC, and Adjusted \(R^2\) plots, with the “best” model (the min/max) marked with a blue ‘x’.
    • Why it’s important: It explicitly states the selection criteria. The text highlights that BIC suggests a 4-variable model, while the other two are “rather flat” after 4, making the choice less obvious but pointing to a simple model.
  3. Slide ...230045.png (BIC vs. Validation vs. CV):

    • What it is: This shows three plots for selecting the best model using different criteria: BIC, Validation Set Error, and Cross-Validation Error.
    • Why it’s important: It shows that different selection criteria can lead to different “best” models. Here, BIC (a mathematical adjustment) picks a 4-variable model, while validation and CV (direct error estimation) both pick a 6-variable model.

The slides use the Credit dataset to demonstrate two key tasks: 1. Running different subset selection algorithms (forward, backward, best). 2. Using various statistical metrics (BIC, \(C_p\), CV error) to choose the single best model.

Comparing Selection Algorithms (The Path)

This part of the example compares the sequence of models selected by “Forward Stepwise” selection versus “Best Subset” selection.

Key Result (from Table 6.1):

This table is the most important result for comparing the algorithms.

Variables Best Subset Forward Stepwise
one rating rating
two rating, income rating, income
three rating, income, student rating, income, student
four cards, income, student, limit rating, income, student, limit

Summary of this result:

  • Identical for 1, 2, and 3 variables: Both methods agree on the best one-variable model (rating), the best two-variable model (rating, income), and the best three-variable model (rating, income, student).
  • They Diverge at 4 variables:
    • Forward selection is greedy. It started with rating, income, student and was “stuck” with them. It then added limit, as that was the best variable to add to its existing 3-variable model.
    • Best subset selection is not greedy. It tests all possible 4-variable combinations. It discovered that the model cards, income, student, limit has a slightly lower RSS than the model forward selection found.
  • Main Takeaway: This demonstrates the limitation of a greedy algorithm. Forward selection missed the “true” best 4-variable model because it was locked into its previous choices and couldn’t “swap out” rating for cards.

Choosing the Single Best Model (The Destination)

This is the most critical part of the analysis. After running a selection algorithm (like forward, backward, or best subset), you get a list of the “best” models for each size (best 1-variable, best 2-variable, etc.). Now you must decide: is the best model the 4-variable one, the 6-variable one, or another?

The slides show several plots to help make this decision, all plotted against the “Number of Predictors.”

Summary of Plot Results:

Here’s what each plot tells you:

  • Residual Sum of Squares (RSS) (e.g., in slide ...230014.png, top-left)
    • What it shows: RSS always decreases as you add more variables. It drops sharply until 4 variables, then flattens out.
    • Conclusion: This plot is not useful for picking the best model because it will always pick the full model, which is overfit. It’s only used to see the diminishing returns of adding new variables.
  • Adjusted \(R^2\) (e.g., in slide ...230040.png, right)
    • What it shows: This metric penalizes adding useless variables. The plot rises quickly, then flattens, peaking at its maximum value around 6 or 7 variables.
    • Conclusion: This metric suggests a 6 or 7-variable model.
  • Mallows’ \(C_p\) (e.g., in slide ...230040.png, left)
    • What it shows: This is an estimate of test error. We want the model with the minimum \(C_p\). The plot drops to a low value at 4 variables and stays low, with its absolute minimum around 6 or 7 variables.
    • Conclusion: This metric also suggests a 6 or 7-variable model.
  • BIC (Bayesian Information Criterion) (e.g., in slide ...230040.png, center)
    • What it shows: This is another estimate of test error, but it has a stronger penalty for model complexity. The plot shows a clear “U” shape, reaching its minimum value at 4 variables and then increasing afterward.
    • Conclusion: This metric strongly suggests a 4-variable model.
  • Validation Set & Cross-Validation (CV) Error (Slide ...230045.png)
    • What it shows: These plots show the direct estimate of test error (not a mathematical adjustment like BIC or \(C_p\)). Both the validation set error and the 10-fold CV error show a “U” shape.
    • Conclusion: Both methods reach their minimum error at 6 variables. This is considered a very reliable result.

Final Summary of Results

The analysis of the Credit dataset reveals two strong candidates for the “best” model, depending on your goal:

  1. The 6-Variable Model: This model is supported by the Adjusted \(R^2\), Mallows’ \(C_p\), and (most importantly) the Validation Set and 10-fold Cross-Validation results. These metrics all indicate that the 6-variable model has the lowest prediction error on new data.

  2. The 4-Variable Model: This model is supported by BIC. Because BIC penalizes complexity more heavily, it selects a simpler (more parsimonious) model.

Overall Conclusion: If your primary goal is maximum predictive accuracy, you should choose the 6-variable model. If your goal is a simpler, more interpretable model that is still very good (and avoids any risk of overfitting), the 4-variable model is an excellent choice.

5. Two main strategies for controlling model complexity in linear regression

This presentation covers two main strategies for controlling model complexity in linear regression: Subset Selection (choosing which variables to include) and Shrinkage Methods (keeping all variables but reducing the impact of their coefficients).

Subset Selection

This method involves selecting a subset of the \(p\) total predictors to use in the model.

Key Concepts & Formulas

  • The Model: The standard linear regression model is represented in matrix form: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\] The goal of subset selection is to find a coefficient vector \(\boldsymbol{\beta}\) that is sparse, meaning it has many zero entries.

  • Forward Selection: This is a greedy algorithm that starts with an empty model and iteratively adds the single predictor that most improves the fit.

  • Theoretical Guarantee: Can forward selection find the true sparse set of variables?

    • Yes, if the predictors are not strongly correlated.
    • This is quantified by the Mutual Coherence Condition. Assuming the predictors \(\mathbf{x}_i\) are normalized, the method is guaranteed to work if: \[\mu = \max_{i \neq j} |\langle \mathbf{x}_i, \mathbf{x}_j \rangle| < \frac{1}{2s - 1}\] where \(s\) is the number of true non-zero coefficients and \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\) represents the correlation between predictors.

Practical Application: Finding the Best Model Size

How do you know whether to choose a model with 3, 4, or 5 variables? You use Cross-Validation (CV).

  • Important Image: The plot titled “10-fold CV” (from the first slide) is the most important visual. It plots the estimated test error (CV Error) on the y-axis against the number of variables in the model on the x-axis.

  • The “One Standard Deviation Rule”: Looking at the plot, the error drops sharply and then flattens. The absolute minimum error might be at 6 variables, but it’s only slightly better than the 3-variable model.

    1. Find the model with the lowest CV error.
    2. Calculate the standard error for that error estimate.
    3. Select the simplest model (fewest variables) whose error is within one standard deviation of the minimum.
    4. This follows Occam’s razor: choose the simplest explanation (model) that fits the data well enough. In the example given, this rule selects the 3-variable model.

Code Interpretation (R vs. Python)

The R code in the first slide performs this 10-fold CV manually for forward selection:

  1. It loops from p = 1 to 10 (model sizes).
  2. Inside the loop, it identifies the p variables chosen by a pre-computed forward selection model (regfit.fwd).
  3. It fits a new model (glm.fit) using only those p variables.
  4. It runs 10-fold CV (cv.glm) on that specific model to get its test error.
  5. It stores the error in CV10.err[p].
  6. Finally, it plots the results.

In Python (with scikit-learn): This entire process is often automated.

  • You would use sklearn.feature_selection.RFECV (Recursive Feature Elimination with Cross-Validation).
  • RFECV automatically performs cross-validation to find the optimal number of features, effectively producing the same plot and result as the R code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Conceptual Python equivalent for finding the best model size
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_regression

# X, y = load_your_data()
X, y = make_regression(n_samples=100, n_features=10, n_informative=3, noise=10, random_state=42)

estimator = LinearRegression()
# RFECV will test models with 1 feature, 2 features, etc.,
# and use cross-validation (cv=10) to find the best number.
selector = RFECV(estimator, step=1, cv=10, scoring='neg_mean_squared_error')
selector = selector.fit(X, y)

print(f"Optimal number of features: {selector.n_features_}")
# You can plot selector.cv_results_['mean_test_score'] to get the CV curve

Shrinkage Methods (Regularization)

Instead of explicitly removing variables, shrinkage methods keep all \(p\) variables but shrink their coefficients \(\beta_j\) towards zero.

Ridge Regression

Ridge regression is a prime example of a shrinkage method.

  • Objective Function: It finds the coefficients \(\boldsymbol{\beta}\) that minimize a new quantity: \[\underbrace{\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2}_{\text{RSS (Goodness of Fit)}} + \underbrace{\lambda \sum_{j=1}^{p} \beta_j^2}_{\text{$\ell_2$ Penalty (Shrinkage)}}\]

  • The \(\lambda\) Tuning Parameter: This parameter controls the strength of the penalty:

    • If \(\lambda = 0\): The penalty term disappears. Ridge regression is identical to standard Ordinary Least Squares (OLS).
    • If \(\lambda \to \infty\): The penalty is “infinitely” strong. To minimize the function, all coefficients \(\beta_j\) (for \(j=1...p\)) are forced to be zero. The model becomes an intercept-only model.
    • Note: The intercept \(\beta_0\) is not penalized.
  • The Bias-Variance Trade-off: This is the core concept of regularization.

    • Standard OLS has low bias but can have high variance (it overfits).
    • Ridge regression adds a small amount of bias (the coefficients are “wrong” on purpose) to significantly reduce the model’s variance.
    • This trade-off often leads to a model with a lower overall test error.
  • Matrix Solution: The discussion slide asks “What is the solution?”. While OLS has the solution \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\), the Ridge solution is: \[\hat{\boldsymbol{\beta}}^R = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\] where \(\mathbf{I}\) is the identity matrix. The \(\lambda \mathbf{I}\) term adds a “ridge” to the diagonal, making the matrix invertible even if \(\mathbf{X}^T\mathbf{X}\) is singular (which happens if \(p > n\) or predictors are collinear).

An Essential Step: Standardization

  • Problem: The \(\ell_2\) penalty \(\lambda \sum \beta_j^2\) is applied equally to all coefficients. If predictor \(x_1\) (e.g., house size in sq-ft) is on a much larger scale than \(x_2\) (e.g., number of rooms), its coefficient \(\beta_1\) will naturally be much smaller than \(\beta_2\). The penalty will unfairly punish \(\beta_2\) more.
  • Solution: You must standardize your inputs before fitting a Ridge model.
  • Formula: For each predictor \(X_j\), all its observations \(x_{ij}\) are rescaled: \[\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j}\] (where \(\bar{x}_j\) is the mean of the predictor and \(\sigma_j\) is its standard deviation). This puts all predictors on a common scale (mean=0, std=1).

In Python (with scikit-learn):

  • You use sklearn.preprocessing.StandardScaler to standardize your data.
  • You use sklearn.linear_model.Ridge to fit the model.
  • You use sklearn.linear_model.RidgeCV to automatically find the best value for \(\lambda\) (called alpha in scikit-learn) using cross-validation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Conceptual Python code for Ridge Regression
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# X, y = load_your_data()

# Create a pipeline that first standardizes the data,
# then fits a Ridge model.
# RidgeCV tests a range of alphas (lambdas) automatically.
model = make_pipeline(
StandardScaler(),
RidgeCV(alphas=[0.1, 1.0, 10.0, 100.0], scoring='neg_mean_squared_error')
)

model.fit(X, y)

print(f"Best alpha (lambda): {model.named_steps['ridgecv'].alpha_}")
print(f"Model coefficients: {model.named_steps['ridgecv'].coef_}")

Subset Selection

This section is about choosing which predictors (variables) to include in your linear model. The main idea is to find a “sparse” model (one with few variables) that performs well.

The Model and The Goal

  • Slide: “Forward selection in Linear Regression”
  • Formula: The standard linear regression model is \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)
    • \(\mathbf{y}\) is the \(n \times 1\) vector of outcomes.
    • \(\mathbf{X}\) is the \(n \times (p+1)\) matrix of predictors (with a leading column of 1s for the intercept).
    • \(\boldsymbol{\beta}\) is the \((p+1) \times 1\) vector of coefficients (\(\beta_0, \beta_1, ..., \beta_p\)).
    • \(\boldsymbol{\epsilon}\) is the \(n \times 1\) vector of irreducible error.
  • Key Question: “If \(\boldsymbol{\beta}\) is sparse with at most \(s\) non-zero entries, can forward selection find those variables?”
    • Sparse means most coefficients are zero.
    • Forward Selection is a greedy algorithm:
      1. Start with no variables.
      2. Add the one variable that gives the best fit.
      3. Add the next best variable to the existing model.
      4. Repeat until you have a model with \(s\) variables.
    • The slide suggests the answer is yes, but only under certain conditions.

The Condition for Success

  • Slide: “Orthogonal Matching Pursuit”
  • Key Concept: Forward selection can provably find the correct variables if those variables are not strongly correlated.
  • Formula: This is formalized by the Mutual Coherence Condition: \[\mu = \max_{i \neq j} |\langle \mathbf{x}_i, \mathbf{x}_j \rangle| < \frac{1}{2s - 1}\]
    • What it means:
      • assuming $\mathbf{x}_i$'s are normalized means we’ve scaled them to have a length of 1.
      • \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\) is the dot product, which is just their correlation since they are normalized.
      • \(\mu\) (mu) is the largest absolute correlation you can find between any two different predictors.
      • \(s\) is the true number of important variables.
    • In English: If the maximum correlation between any of your predictors is less than this threshold, the greedy forward selection algorithm is guaranteed to find the true, sparse set of variables.

How to Choose the Model Size (Practice)

The theory is nice, but in practice, you don’t know \(s\). How many variables should you pick?

  • Slide: “10-fold CV Errors”

  • This is the most important practical slide for this section.

  • What the plot shows:

    • X-axis: “Number of Variables” (from 1 to 10).
    • Y-axis: “CV Error” (the 10-fold cross-validated Mean Squared Error).
    • The Curve: The error drops very fast as we add the first 2-3 variables. Then, it flattens out. Adding more than 3 variables doesn’t really help much.
  • Slide: “The one standard deviation rule”

  • This rule helps you pick the “best” model from the CV plot.

    1. Find the model with the absolute minimum CV error (in the plot, this looks to be around 6 or 7 variables).
    2. Calculate the standard error of that minimum CV error.
    3. Draw a “tolerance” line at (minimum error) + (one standard error).
    4. Choose the simplest model (fewest variables) whose CV error is below this tolerance line.
    • The slide states this rule “gives the model with 3 variable” for this example. This is because the 3-variable model is much simpler than the 6-variable one, and its error is “good enough” (within one standard deviation of the minimum). This is an application of Occam’s razor.

Code: R vs. Python

The R code on the “10-fold CV Errors” slide generates that exact plot.

  • R Code Explained:

    • library(boot): Loads the cross-validation library.
    • CV10.err=rep(0,10): Creates an empty vector to store the 10 error scores.
    • for(p in 1:10): A loop that will test model sizes from 1 to 10.
    • x<-which(summary(regfit.fwd)$which[p,]): Gets the names of the \(p\) variables chosen by a pre-run forward selection (regfit.fwd).
    • glm.fit=glm(Balance~.,data=newCred): Fits a model using only those \(p\) variables.
    • cv.err=cv.glm(newCred,glm.fit,K=10): Performs 10-fold CV on that specific \(p\)-variable model.
    • CV10.err[p]<-cv.err$delta[1]: Stores the CV error.
    • plot(...): Plots the 10 errors against the 10 model sizes.
  • Python Equivalent (Conceptual):

    • In scikit-learn, this process is often automated. You wouldn’t write the CV loop yourself.
    • You would use sklearn.feature_selection.RFECV (Recursive Feature Elimination with Cross-Validation). This tool automatically wraps a model (like LinearRegression), performs cross-validation, and finds the optimal number of features, effectively producing the same plot and result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# --- Python equivalent for 6.1 ---
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Assume X and y are your data

# 1. Create a pipeline
# (Note: It's good practice to scale, even for OLS, if you're comparing)
pipeline = make_pipeline(
StandardScaler(),
LinearRegression()
)

# 2. Create the RFECV (Recursive Feature Elimination w/ CV) object
# This is an *alternative* to forward selection, but serves the same purpose
# It will test models with 1, 2, 3... features using 10-fold CV
feature_selector = RFECV(
estimator=pipeline,
min_features_to_select=1,
step=1,
cv=10,
scoring='neg_mean_squared_error' # We want to minimize error
)

# 3. Fit it
feature_selector.fit(X, y)

print(f"Optimal number of features found: {feature_selector.n_features_}")

# You could then plot feature_selector.cv_results_['mean_test_score']
# to replicate the R plot.

Shrinkage Methods by Regularization

This is a different approach. Instead of removing variables, we keep all \(p\) variables but shrink their coefficients \(\beta_j\) towards 0.

Ridge Regression: The Core Idea

  • Slide: “Ridge regression”
  • Formula: Ridge regression minimizes a new objective function: \[\min_{\boldsymbol{\beta}} \left( \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right)\]
    • Term 1: \(\text{RSS}\) (Residual Sum of Squares). This is the original OLS “goodness of fit” term. We want this to be small.
    • Term 2: \(\lambda \sum \beta_j^2\). This is the \(\ell_2\) penalty or “shrinkage penalty”. It adds a “cost” for having large coefficients.
  • The \(\lambda\) (lambda) Parameter:
    • This is the tuning parameter that controls the trade-off between fit and simplicity.
    • $\lambda = 0$: No penalty. The objective is just to minimize RSS. The solution \(\hat{\boldsymbol{\beta}}^R\) is identical to the OLS solution \(\hat{\boldsymbol{\beta}}\).
    • $\lambda = \infty$: Infinite penalty. The only way to minimize the cost is to make all \(\beta_j = 0\) (for \(j \ge 1\)). The model becomes an intercept-only model.
    • Large $\lambda$: Heavy penalty, more shrinkage.
    • Crucial Note: The intercept \(\beta_0\) is not penalized. This is because \(\beta_0\) just represents the mean of \(y\) when all \(x\)’s are 0; shrinking it makes no sense.

The Need for Standardization

  • Slide: “Standardize the inputs”
  • Problem: The penalty \(\lambda \sum \beta_j^2\) is applied to all coefficients. But what if \(x_1\) is “house size in sq-ft” (values 1000-5000) and \(x_2\) is “number of bedrooms” (values 1-5)?
    • The coefficient \(\beta_1\) for house size will naturally be tiny, while the coefficient \(\beta_2\) for bedrooms will be large, even if they are equally important.
    • Ridge regression would unfairly and heavily penalize \(\beta_2\) while barely touching \(\beta_1\).
  • Solution: You must standardize all predictors before fitting a Ridge model.
  • Formula: For each observation \(i\) of each predictor \(j\): \[\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sqrt{(1/n) \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)^2}}\]
    • This formula rescales every predictor to have a mean of 0 and a standard deviation of 1.
    • Now, all coefficients \(\beta_j\) are on a “level playing field” and can be penalized fairly.

Answering the Discussion Questions

  • Slide: “DISCUSSION”
    • What is the solution of Ridge regression?
    • What is the bias and the variance?

1. What is the solution of Ridge regression?

The solution can be written in matrix form, which is very elegant.

  • Standard OLS Solution: The coefficients \(\hat{\boldsymbol{\beta}}^{\text{OLS}}\) that minimize RSS are found by: \[\hat{\boldsymbol{\beta}}^{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

  • Ridge Regression Solution: The coefficients \(\hat{\boldsymbol{\beta}}^{R}\) that minimize the Ridge objective are: \[\hat{\boldsymbol{\beta}}^{R} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

    • Explanation:
      • \(\mathbf{I}\) is the identity matrix (a matrix of 1s on the diagonal, 0s everywhere else).
      • By adding \(\lambda\mathbf{I}\), we are adding a positive value \(\lambda\) to the diagonal of the \(\mathbf{X}^T\mathbf{X}\) matrix.
      • This addition stabilizes the matrix. \(\mathbf{X}^T\mathbf{X}\) might not be invertible (if \(p > n\) or if predictors are perfectly collinear), but \((\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})\) is always invertible for \(\lambda > 0\).
      • This addition is what mathematically “shrinks” the coefficients toward zero.

2. What is the bias and the variance?

This is the most important concept in regularization. It’s the bias-variance trade-off.

  • Standard OLS (where \(\lambda=0\)):

    • Bias: Low. The OLS estimator is unbiased, meaning that if you took many samples and fit many OLS models, their average \(\hat{\boldsymbol{\beta}}\) would be the true \(\boldsymbol{\beta}\).
    • Variance: High. The OLS solution can be highly sensitive to the training data. If you change a few data points, the coefficients can swing wildly. This is especially true if \(p\) is large or predictors are correlated. This “sensitivity” is high variance, which leads to overfitting.
  • Ridge Regression (where \(\lambda > 0\)):

    • Bias: High(er). Ridge regression is a biased estimator. By adding the penalty, we are purposefully pulling the coefficients away from the OLS solution and towards zero. The average \(\hat{\boldsymbol{\beta}}^R\) from many samples will not equal the true \(\boldsymbol{\beta}\). We have introduced bias into our model.
    • Variance: Low(er). In exchange for this bias, we get a massive reduction in variance. The \(\lambda\mathbf{I}\) term stabilizes the solution. The coefficients won’t change wildly even if the training data changes. The model is more robust and less sensitive.

The Trade-off: The total expected test error of a model is: \(\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\)

By using Ridge regression, we increase the \(\text{Bias}^2\) term a little, but we decrease the \(\text{Variance}\) term a lot. The goal is to find a \(\lambda\) where the total error is minimized. Ridge regression reduces variance at the cost of increased bias.

Python Equivalent for 6.2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# --- Python equivalent for 6.2 ---
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Assume X and y are your data

# 1. Create a pipeline that AUTOMATICALLY
# - Standardizes the data
# - Fits a Ridge Regression model
# - Uses Cross-Validation to find the BEST lambda (alpha in scikit-learn)
alphas_to_test = [0.01, 0.1, 1.0, 10.0, 100.0]

# RidgeCV handles everything for us
pipeline = make_pipeline(
StandardScaler(),
RidgeCV(alphas=alphas_to_test, scoring='neg_mean_squared_error', cv=10)
)

# 2. Fit the pipeline
pipeline.fit(X, y)

# 3. Get the results
best_lambda = pipeline.named_steps['ridgecv'].alpha_
ridge_coefficients = pipeline.named_steps['ridgecv'].coef_
intercept = pipeline.named_steps['ridgecv'].intercept_

print(f"Best lambda (alpha) found by CV: {best_lambda}")
print(f"Model intercept (beta_0): {intercept}")
print(f"Model coefficients (beta_j): {ridge_coefficients}")

6. The “Why” of Ridge Regression

Core Concepts: The “Why” of Ridge Regression

Your slides explain that ridge regression is a “shrinkage method” designed to solve a major problem with standard Ordinary Least Squares (OLS) regression: high variance.

The Bias-Variance Tradeoff (Slide 3)

This is the most important theoretical concept. In prediction, the total error (Mean Squared Error, or MSE) of a model is composed of three parts: \(\text{Error} = \text{Variance} + \text{Bias}^2 + \text{Irreducible Error}\)

  • Ordinary Least Squares (OLS): Aims to be unbiased (low bias). However, when you have many predictors (\(p\)), especially if they are correlated, or if \(p\) is large compared to the number of samples \(n\) (\(p \approx n\) or \(p > n\)), the OLS model becomes highly unstable. A small change in the training data can cause the coefficients to change wildly. This is high variance. (See Slide 6, “Remarks”).
  • Ridge Regression: By adding a penalty, ridge intentionally introduces a small amount of bias (it pulls coefficients away from their “true” OLS values). In return, it achieves a massive reduction in variance.

As Slide 3 shows:

  • The green line (Variance) starts very high for low \(\lambda\) (left side) and drops quickly.
  • The black line (Squared Bias) starts at zero (for OLS at \(\lambda=0\)) and slowly increases as \(\lambda\) grows.
  • The purple line (Test MSE) is the sum of the two. It’s U-shaped. The goal of ridge is to find the \(\lambda\) (marked by the ‘x’) at the bottom of this “U,” which gives the lowest possible total error.

Why Is It Called “Ridge”? The 3D Spatial Meaning (Slide 5)

This slide explains the problem of collinearity and the origin of the name.

  • Left Plot (Least Squares): Imagine a model with two correlated predictors, \(\beta_1\) and \(\beta_2\). The y-axis (SS1) is the error (RSS). Because the predictors are correlated, there isn’t one single “point” that is the minimum. Instead, there’s a long, flat valley or trough (marked “unstable”). Many different combinations of \(\beta_1\) and \(\beta_2\) along this valley give a similarly low error. The OLS solution is unstable because it can pick any point in this flat-bottomed valley.
  • Right Plot (Ridge): The ridge objective function adds a penalty term: \(\lambda(\beta_1^2 + \beta_2^2)\). This penalty term, by itself, is a perfect circular bowl centered at (0,0). When you add this “bowl” to the OLS “valley,” it stabilizes the function. It pulls the minimum towards (0,0) and creates a single, stable, well-defined minimum.
  • The “Ridge” Name: The penalty \(\lambda\mathbf{I}\) (from the matrix formula) adds a “ridge” of values to the diagonal of the \(\mathbf{X}^T\mathbf{X}\) matrix, which geometrically turns the unstable flat valley into a stable bowl.

Mathematical Formulas

The key difference between OLS and Ridge is the function they try to minimize.

  1. OLS Objective Function: Minimize the Residual Sum of Squares (RSS). \[\text{RSS} = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2\]

  2. Ridge Objective Function (Slide 6): Minimize the RSS plus an L2 penalty term. \[\text{Minimize: } \left[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 \right] + \lambda \sum_{j=1}^{p} \beta_j^2\]

    • \(\lambda\) is the tuning parameter controlling the penalty strength.
    • \(\sum_{j=1}^{p} \beta_j^2\) is the L2-norm (squared) of the coefficients. It penalizes large coefficients.
  3. L2 Norm (Slide 1): The L2 norm of a vector \(\mathbf{a}\) is its standard Euclidean length. The plot on Slide 1 uses this to show the total magnitude of the ridge coefficients. \[\|\mathbf{a}\|_2 = \sqrt{\sum_{j=1}^p a_j^2}\]

  4. Matrix Solution (Slide 6): This is the “closed-form” solution for the ridge coefficients \(\hat{\beta}^R\). \[\hat{\beta}^R = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

    • \(\mathbf{I}\) is the identity matrix.
    • The term \(\lambda\mathbf{I}\) is what stabilizes the \(\mathbf{X}^T\mathbf{X}\) matrix, making it invertible even if it’s singular (due to \(p > n\) or collinearity).

Walkthrough of the “Credit Data” Example (All Slides)

Here is the logical story of the R code, from start to finish.

Step 1: Data Preparation (Slide 8)

  • x=scale(model.matrix(Balance~., Credit)[,-1])
    • model.matrix(...) creates the predictor matrix x.
    • scale(...) is critically important. It standardizes all predictors to have a mean of 0 and a standard deviation of 1. This is necessary because the ridge penalty \(\lambda \sum \beta_j^2\) is unit-dependent. If Income (in 10,000s) and Cards (1-10) were unscaled, the penalty would unfairly crush the Income coefficient. Scaling puts all predictors on a level playing field.
  • y=Credit$Balance
    • This sets the y (target) variable.

Step 2: Fit the Ridge Model (Slide 8)

  • grid=10^seq(4,-2,length=100)
    • This creates a grid of 100 \(\lambda\) values to test, ranging from \(10^4\) (a huge penalty) down to \(10^{-2}\) (a tiny penalty).
  • ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
    • This is the main command. It fits a separate ridge model for every single \(\lambda\) in the grid.
    • alpha=0 is the specific command that tells glmnet to perform Ridge Regression. (Setting alpha=1 would be LASSO).
  • coef(ridge.mod)[,50]
    • This inspects the model. It pulls out the vector of coefficients for the 50th \(\lambda\) in the grid (which is \(\lambda=10.72\)).

Step 3: Visualize the Coefficient “Solution Path” (Slides 1, 4, 9)

These plots all show the same thing: how the coefficients change as \(\lambda\) changes.

  • Slide 9 Plot: This plots the standardized coefficients for 4 predictors (Income, Limit, Rating, Student) against the index (1 to 100). Index 1 (left) is the largest \(\lambda\), and index 100 (right) is the smallest \(\lambda\) (closest to OLS). You can see the coefficients “grow” from 0 as the penalty (\(\lambda\)) gets smaller.
  • Slide 1 (Left Plot): This is the same plot as Slide 9, but more professional. It plots the coefficients against \(\lambda\) on a log scale. You can clearly see all coefficients (gray lines) being “shrunk” toward zero as \(\lambda\) increases (moves right). The key predictors (Income, Rating, etc.) are highlighted.
  • Slide 1 (Right Plot): This is the exact same data again, but with a different x-axis: \(\|\hat{\beta}_\lambda^R\|_2 / \|\hat{\beta}\|_2\).
    • 1.0 on the right means \(\lambda=0\). The ratio of the ridge norm to the OLS norm is 1 (they are the same).
    • 0.0 on the left means \(\lambda=\infty\). The ridge coefficients are all 0, so their norm is 0.
    • This axis shows the “fraction” of the full OLS coefficient magnitude that the model is using.
  • Slide 4 Plot: This plots the total L2 norm of all coefficients (\(\|\hat{\beta}_\lambda^R\|_2\)) against the index. As the index goes from 1 to 100 (i.e., \(\lambda\) gets smaller), the total magnitude of the coefficients gets larger, which is exactly what we expect.

Step 4: Find the Best \(\lambda\) using Cross-Validation (Slides 4 & 7)

We have 100 models. Which one is best?

  • The “Manual” Way (Slide 4):

    • The code splits the data into a train and test set.
    • It fits a model only on the train set.
    • It tests two \(\lambda\) values:
      • s=4: Gives a test MSE of 10293.33.
      • s=10: Gives a test MSE of 168981.1 (much worse!).
    • This shows that \(\lambda=4\) is better than \(\lambda=10\), but we don’t know if it’s the best.
  • The “Automatic” Way (Slide 7):

    • cv.out=cv.glmnet(x[train,], y[train], alpha=0)
    • This runs 10-fold Cross-Validation on the training set. It automatically splits the training set into 10 “folds,” trains on 9, tests on 1, and repeats this 10 times for every \(\lambda\).
    • The Plot: The plot on this slide is the result. It shows the average MSE (y-axis) for each \(\log(\lambda)\) (x-axis). This is the real-data version of the theoretical purple curve from Slide 3.
    • bestlam=cv.out$lambda.min
    • This command finds the \(\lambda\) at the very bottom of the U-shaped curve. The output shows bestlam is 41.6.
    • ridge.pred=predict(ridge.mod, s=bestlam, newx=x[test,])
    • Now, we use this one best \(\lambda\) to make predictions on our held-out test set.
    • mean((ridge.pred-y.test)^2)
    • The final, reliable test MSE is 16129.68. This is our best estimate of how the model will perform on new, unseen data.

Python (scikit-learn) Equivalents

Here is how you would perform the entire R workflow from your slides in Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# --- 1. Load and Prepare Data (like Slide 8) ---
# Assuming 'Credit' is a pandas DataFrame
# X = Credit.drop('Balance', axis=1)
# y = Credit['Balance']
# ... (need to handle categorical variables first, e.g., with pd.get_dummies) ...
# For this example, let's assume X and y are already loaded and numeric.

# Standardize the predictors (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 2. Train/Test Split (like Slide 4) ---
# test_size=0.5 and random_state=1 mimic the R code
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.5, random_state=1
)

# --- 3. Find Best Lambda (alpha) with Cross-Validation (like Slide 7) ---
# Create the same log-spaced grid of lambdas (sklearn calls it 'alpha')
lambda_grid = np.logspace(4, -2, 100)

# RidgeCV performs cross-validation to find the best alpha
# cv=10 matches the 10-fold CV
# store_cv_values=True is needed to plot the CV error curve
cv_model = RidgeCV(alphas=lambda_grid, store_cv_values=True, cv=10)
cv_model.fit(X_train, y_train)

# Get the best lambda found
best_lambda = cv_model.alpha_
print(f"Best lambda (alpha) found by CV: {best_lambda}")

# Plot the CV error curve (like Slide 7 plot)
# cv_model.cv_values_ has shape (n_samples, n_alphas)
# We need to average over the samples for each alpha
mse_path = np.mean(cv_model.cv_values_, axis=0)
plt.figure()
plt.plot(np.log10(cv_model.alphas_), mse_path, marker='o')
plt.xlabel("Log(lambda)")
plt.ylabel("Mean Squared Error")
plt.title("Cross-Validation Error Path")
plt.show()

# --- 4. Evaluate on Test Set (like Slide 7) ---
# 'cv_model' is already refit on the full training set using the best_lambda
test_pred = cv_model.predict(X_test)
final_test_mse = mean_squared_error(y_test, test_pred)
print(f"Final Test MSE with best lambda: {final_test_mse}")

# --- 5. Get Final Coefficients (like Slide 7, bottom) ---
# The coefficients from the CV-trained model:
print(f"Intercept: {cv_model.intercept_}")
print("Coefficients:")
for coef, feature in zip(cv_model.coef_, X.columns):
print(f" {feature}: {coef}")

# --- 6. Plot the Solution Path (like Slide 1) ---
# To do this, we fit a Ridge model for each lambda and store the coefficients
coefs = []
for lam in lambda_grid:
model = Ridge(alpha=lam)
model.fit(X_scaled, y) # Fit on all data
coefs.append(model.coef_)

# Plot
plt.figure()
plt.plot(np.log10(lambda_grid), coefs)
plt.xlabel("Log(lambda)")
plt.ylabel("Standardized Coefficients")
plt.title("Ridge Solution Path")
plt.show()

7. Shrinkage Methods (Regularization)

These slides cover Shrinkage Methods, also known as Regularization, which are techniques used to improve on the standard least squares model, particularly when dealing with many variables or multicollinearity. The main focus is on LASSO regression.

Key Mathematical Formulas

The slides present two main, but equivalent, ways to formulate these methods.

1. Penalized Formulation (Slide 1)

This is the most common formulation. The goal is to minimize a function that is a combination of the Residual Sum of Squares (RSS) and a penalty term. The penalty discourages large coefficients.

  • LASSO (Least Absolute Shrinkage and Selection Operator): The goal is to find coefficients (\(\beta_0, \beta_j\)) that minimize: \[\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j|\]
    • Penalty: The \(L_1\) norm (\(\|\beta\|_1\)), which is the sum of the absolute values of the coefficients.
    • Key Property: This penalty can force some coefficients to be exactly zero, effectively performing automatic variable selection.

2. Constrained Formulation (Slide 2)

This alternative formulation minimizes the RSS subject to a constraint (a “budget”) on the size of the coefficients.

  • For Lasso: Minimize RSS subject to: \[\sum_{j=1}^{p} |\beta_j| \le s\] (The sum of the absolute values of the coefficients must be less than some budget \(s\).)

  • For Ridge: Minimize RSS subject to: \[\sum_{j=1}^{p} \beta_j^2 \le s\] (The sum of the squares of the coefficients (\(L_2\) norm) must be less than \(s\).)

Equivalence (Slide 3): For any penalty value \(\lambda\) used in the first formulation, there is a corresponding budget \(s\) in the second formulation that will give the exact same set of coefficients. \(\lambda\) and \(s\) are inversely related: a large \(\lambda\) (high penalty) corresponds to a small \(s\) (small budget).

Important Plots and Interpretation

Your slides show the two most important plots for understanding and using LASSO.

1. The Cross-Validation (CV) Plot (Slide 5)

This plot is crucial for choosing the best tuning parameter (\(\lambda\)).

  • X-axis: \(\text{Log}(\lambda)\). This is the penalty strength.
    • Right side (high \(\lambda\)): High penalty, simple model (many coefficients are 0), high bias, high Mean-Squared Error (MSE).
    • Left side (low \(\lambda\)): Low penalty, complex model (like standard linear regression), high variance, MSE starts to increase (overfitting).
  • Y-axis: Mean-Squared Error (MSE) from cross-validation.
  • Goal: Find the \(\lambda\) at the bottom of the “U” shape, which gives the lowest MSE. This is the optimal trade-off between bias and variance. The top axis shows how many variables are included in the model at each \(\lambda\).

2. The Coefficient Path Plot (Slide 6)

This plot is the best visualization for understanding what LASSO does.

  • Left Plot (vs. \(\lambda\)):
    • X-axis: The penalty strength \(\lambda\).
    • Y-axis: The standardized value of each coefficient.
    • How to read it: Start from the right (high \(\lambda\)). All coefficients are 0. As you move left, \(\lambda\) decreases, and the penalty is relaxed. Variables “enter” the model one by one (their coefficients become non-zero). You can see that ‘Rating’, ‘Income’, and ‘Student’ are the most important variables, as they are the first to become non-zero.
  • Right Plot (vs. \(L_1\) Norm Ratio):
    • This shows the exact same information as the left plot, but the x-axis is reversed and rescaled. An axis value of 0.0 means full penalty (all \(\beta=0\)), and 1.0 means no penalty.

Code Understanding (R to Python)

The slides use the glmnet package in R. The equivalent and most popular library in Python is scikit-learn.

1. Finding the Best \(\lambda\) (CV)

The R code cv.out=cv.glmnet(x[train,],y[train],alpha=1) performs cross-validation to find the best \(\lambda\).

  • Python Equivalent: Use LassoCV. It does the same thing: tests many \(\lambda\) values (called alphas in scikit-learn) and picks the best one.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.linear_model import LassoCV

# Create the LassoCV object
# cv=5 means 5-fold cross-validation
lasso_cv = LassoCV(cv=5, random_state=0)

# Fit the model to the training data
lasso_cv.fit(X_train, y_train)

# Get the best lambda (called alpha_ in sklearn)
best_lambda = lasso_cv.alpha_
print(f"Best lambda (alpha): {best_lambda}")

# Get the MSEs
# This is what's plotted in the CV plot
print(lasso_cv.mse_path_)

2. Fitting with the Best \(\lambda\) and Getting Coefficients

The R code lasso.coef=predict(out,type="coefficients",s=bestlam) gets the coefficients for the best \(\lambda\).

  • Python Equivalent: The LassoCV object is already refitted on the full training data using the best \(\lambda\). You can also fit a new Lasso model with that specific \(\lambda\).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

# --- Option 1: Use the already-fitted LassoCV object ---
print("Coefficients from LassoCV:")
print(lasso_cv.coef_)

# Make predictions on the test set
y_pred = lasso_cv.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {test_mse}")


# --- Option 2: Fit a new Lasso model with the best lambda ---
final_lasso = Lasso(alpha=best_lambda)
final_lasso.fit(X_train, y_train)

# Get coefficients (Slide 7 shows this)
# Note how some are 0!
print("\nCoefficients from new Lasso model:")
print(final_lasso.coef_)

The Core Problem: Two Equivalent Formulas

The slides show two ways of writing the same problem. Understanding this equivalence is key.

Formulation 1: The Penalized Method (Slides 1 & 4)

  • Formula: \[\min_{\beta} \left( \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \beta)^2 + \lambda \|\beta\|_1 \right)\]

    • \(\sum (y_i - \mathbf{x}_i^T \beta)^2\): This is the normal Residual Sum of Squares (RSS). We want to make this small (fit the data well).
    • \(\lambda \|\beta\|_1\): This is the \(L_1\) penalty.
      • \(\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|\) is the sum of the absolute values of the coefficients.
      • \(\lambda\) (lambda) is a tuning parameter. Think of it as a “penalty knob”.
  • How to think about \(\lambda\):

    • If \(\lambda = 0\): There is no penalty. This is just standard Ordinary Least Squares (OLS) regression. The model will likely overfit.
    • If \(\lambda\) is small: There’s a small penalty. Coefficients will shrink a little bit.
    • If \(\lambda\) is very large: The penalty is severe. The only way to make the penalty term small is to make the coefficients (\(\beta\)) themselves small. The model will eventually shrink all coefficients to exactly 0.

Formulation 2: The Constrained Method (Slides 2 & 3)

  • Formula: \[\min_{\beta} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \beta)^2 \quad \text{subject to} \quad \|\beta\|_1 \le s\]

  • How to think about \(s\):

    • This says: “Find the best-fitting model (minimize RSS) but you have a limited ‘budget’ \(s\) for the total size of your coefficients.”
    • If \(s\) is very large: The budget is huge. This constraint does nothing. You get the standard OLS solution.
    • If \(s\) is small: The budget is tight. You must shrink your coefficients to stay under the budget \(s\). To get the best fit, the model will be forced to set unimportant coefficients to 0 and only “spend” its budget on the most important variables.

The Equivalence: These two forms are equivalent. For any \(\lambda\) you pick, there’s a corresponding budget \(s\) that gives the exact same solution.

  • High \(\lambda\) (strong penalty) \(\iff\) Small \(s\) (tight budget)
  • Low \(\lambda\) (weak penalty) \(\iff\) Large \(s\) (loose budget)

This equivalence is why you see plots with both \(\lambda\) and \(L_1\) Norm on the x-axis. They are just two different ways of looking at the same “penalty” spectrum.

Detailed Plot & Code Analysis

Let’s look at the plots and code, which answer the practical questions: (1) How do we pick the best \(\lambda\)? and (2) What does LASSO do to the coefficients?

Question 1: How to pick the best \(\lambda\)? (Slide 5)

This is the Cross-Validation (CV) Plot. Its one and only job is to help you find the optimal \(\lambda\).

  • R Code: cv.out=cv.glmnet(x[train,],y[train],alpha=1)
    • cv.glmnet: This R function automatically does K-fold cross-validation. alpha=1 explicitly tells it to use LASSO (alpha=0 would be Ridge).
    • It tries a whole range of \(\lambda\) values, calculates the Mean-Squared Error (MSE) for each, and stores the results in cv.out.
  • Plot Analysis:
    • X-axis: \(\text{Log}(\lambda)\). The penalty strength. Right = High Penalty (simple model), Left = Low Penalty (complex model).
    • Y-axis: Mean-Squared Error (MSE). Lower is better.
    • Red Dots: The average MSE for each \(\lambda\).
    • Gray Bars: The error bars (standard error).
    • The “U” Shape: This is the classic bias-variance trade-off.
      • Right Side (High \(\lambda\)): The model is too simple (too many coefficients are 0). It’s “underfitting.” The error is high (high bias).
      • Left Side (High \(\lambda\)): The model is too complex (low penalty, like OLS). It’s “overfitting” the training data. The error on new data is high (high variance).
      • Bottom of the “U”: This is the “sweet spot.” The \(\lambda\) at the very bottom (marked by the left vertical dotted line) gives the lowest possible MSE. This is lambda.min.

Answer: You pick the \(\lambda\) that corresponds to the lowest point on this graph.

Question 2: What does LASSO do? (Slides 5, 6, 7)

These slides all show the effect of LASSO.

A. The Coefficient Path Plots (Slides 5 & 6)

These plots visualize how coefficients change. They show the same information just with different x-axes.

  • Left Plot (Slide 6) vs. \(\lambda\):
    • How to read: Read from RIGHT to LEFT.
    • At the far right (\(\lambda\) is large), all coefficients are 0.
    • As you move left, \(\lambda\) gets smaller, and the penalty is relaxed. Variables “enter” the model one by one as their coefficients become non-zero.
    • You can see ‘Rating’ (red-dashed), ‘Student’ (black-solid), and ‘Income’ (blue-dotted) are the first to enter, suggesting they are the most important predictors.
  • Right Plot (Slide 6) vs. \(L_1\) Norm Ratio:
    • This is the same plot, just flipped and rescaled. The x-axis is \(\|\hat{\beta}_\lambda\|_1 / \|\hat{\beta}_{OLS}\|_1\).
    • How to read: Read from LEFT to RIGHT.
    • At 0.0: This is a “0% budget” (like \(s=0\) or \(\lambda=\infty\)). All coefficients are 0.
    • At 1.0: This is a “100% budget” (like \(s=\infty\) or \(\lambda=0\)). This is the full OLS model.
    • This view clearly shows the coefficients “growing” from 0 as their “budget” (\(L_1\) Norm) increases.

B. The Code Output (Slide 7) - This is the most important “answer”

This slide explicitly demonstrates variable selection by comparing the coefficients from two different \(\lambda\) values.

  • First Block (The “Optimal” Model):

    • bestlam.cv <- cv.out$lambda.min: This gets the \(\lambda\) from the bottom of the “U” in the CV plot.
    • lasso.conf <- predict(out,type="coefficients",s=bestlam.cv)[1:12,]: This gets the coefficients using that best \(\lambda\).
    • lasso.conf[lasso.conf!=0]: This R command filters the list to show only the non-zero coefficients.
    • Result: The optimal model still keeps 10 variables (‘Income’, ‘Limit’, ‘Rating’, etc.). It has shrunk them, but it hasn’t set many to 0.
  • Second Block (The “High Penalty” Model):

    • The slide text says “if we choose a larger regularization parameter.” Here, they’ve picked an arbitrary larger value, s=10. (Note: R’s predict.glmnet can be confusing; s=10 here means \(\lambda=10\)).
    • lasso.conf <- predict(out,type="coefficients",s=10)[1:12,]: This gets the coefficients using a stronger penalty (\(\lambda=10\)).
    • lasso.conf[lasso.conf!=0]: Again, show only the non-zero coefficients.
    • Result: Look! The list is much shorter. The coefficients for ‘Age’, ‘Education’, ‘GenderFemale’, ‘MarriedYes’, and ‘Ethnicity’ are all gone (shrunk to 0.000000). The model has decided these are not important enough to “spend” budget on.

Conclusion: LASSO performs automatic variable selection. By increasing \(\lambda\), you create a sparser (simpler) model. Slide 7 is the concrete proof.

Python Equivalents (in more detail)

Here is how you would replicate the entire workflow from the slides in Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV, lasso_path
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# --- Assume X_train, y_train, X_test, y_test are loaded ---
# Example:
# data = pd.read_csv('Credit.csv')
# X = pd.get_dummies(data.drop(['ID', 'Balance'], axis=1), drop_first=True)
# y = data['Balance']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# It's CRITICAL to scale data before regularization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
feature_names = X.columns


# 1. Replicate the CV Plot (Slide 5: ...000200.png)
# LassoCV does what cv.glmnet does: finds the best lambda (alpha)
print("Running LassoCV to find best lambda (alpha)...")
# 'alphas' is the list of lambdas to try. We can let it choose automatically.
# cv=10 means 10-fold cross-validation.
lasso_cv = LassoCV(cv=10, random_state=1, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)

# The best lambda found
best_lambda = lasso_cv.alpha_
print(f"Best lambda (alpha) found: {best_lambda}")

# --- Plotting the CV (MSE vs. Log(Lambda)) ---
# This recreates the R plot
plt.figure(figsize=(10, 6))
# lasso_cv.mse_path_ is a (n_alphas, n_folds) array of MSEs
# We take the mean across the folds (axis=1)
mean_mses = np.mean(lasso_cv.mse_path_, axis=1)
log_lambdas = np.log10(lasso_cv.alphas_)

plt.plot(log_lambdas, mean_mses, 'r.-')
plt.xlabel('Log(Lambda / Alpha)')
plt.ylabel('Mean-Squared Error')
plt.title('LASSO Cross-Validation Path (Replicating R Plot)')
# Plot a vertical line at the best lambda
plt.axvline(np.log10(best_lambda), linestyle='--', color='k', label=f'Best Lambda (alpha) = {best_lambda:.2f}')
plt.legend()
plt.gca().invert_xaxis() # High lambda is on the right in R plot
plt.show()


# 2. Replicate the Coefficient Path Plot (Slide 6: ...000206.png)
# We can use the lasso_path function, or just use the CV object

# The lasso_cv object already calculated the paths!
coefs = lasso_cv.path(X_train_scaled, y_train, alphas=lasso_cv.alphas_)[1].T

plt.figure(figsize=(10, 6))
for i in range(X_train_scaled.shape[1]):
plt.plot(log_lambdas, coefs[:, i], label=feature_names[i])

plt.xlabel('Log(Lambda / Alpha)')
plt.ylabel('Standardized Coefficients')
plt.title('LASSO Coefficient Path (Replicating R Plot)')
plt.legend(loc='upper right')
plt.gca().invert_xaxis()
plt.show()


# 3. Replicate the Code Output (Slide 7: ...000202.png)
print("\n--- Replicating R Output ---")

# --- First Block: Coefficients with BEST lambda ---
print(f"Coefficients using best lambda (alpha = {best_lambda:.4f}):")
# The lasso_cv object is already fitted with the best lambda
best_coefs = lasso_cv.coef_
coef_series_best = pd.Series(best_coefs, index=feature_names)
# This is like R's `lasso.conf[lasso.conf != 0]`
print(coef_series_best[coef_series_best != 0])


# --- Second Block: Coefficients with a LARGER lambda ---
# Let's pick a larger lambda, e.g., 10 (like the slide)
large_lambda = 10
lasso_high_penalty = Lasso(alpha=large_lambda)
lasso_high_penalty.fit(X_train_scaled, y_train)

print(f"\nCoefficients using larger lambda (alpha = {large_lambda}):")
high_pen_coefs = lasso_high_penalty.coef_
coef_series_high = pd.Series(high_pen_coefs, index=feature_names)
# This is the second R command: `lasso.conf[lasso.conf != 0]`
print(coef_series_high[coef_series_high != 0])

# --- Final Prediction ---
# This is R's `mean((lasso.pred-y.test)^2)`
y_pred = lasso_cv.predict(X_test_scaled)
test_mse = mean_squared_error(y_test, y_pred)
print(f"\nTest MSE using best lambda: {test_mse:.2f}")

The “Game” of Regularization

First, let’s understand what these plots are showing. This is a “map” of a constrained optimization problem.

  • The Red Ellipses (RSS Contours): Think of these as contour lines on a topographic map.
    • The Center (\(\hat{\beta}\)): This point is the “bottom of the valley.” It represents the perfect, unconstrained solution—the standard Ordinary Least Squares (OLS) coefficients. This point has the lowest possible Residual Sum of Squares (RSS), or error.
    • The Lines: Every point on a single red ellipse has the exact same RSS. As the ellipses get bigger (moving away from the center \(\hat{\beta}\)), the error gets higher.
  • The Blue Shaded Area (Constraint Region): This is the “rule” of the game.
    • This is our “budget.” We are only allowed to pick a solution (\(\beta_1, \beta_2\)) from inside or on the boundary of this blue shape.
    • LASSO: The constraint is \(|\beta_1| + |\beta_2| \le s\). This equation forms a diamond (or a rotated square).
    • Ridge: The constraint is \(\beta_1^2 + \beta_2^2 \le s\). This equation forms a circle.
  • The Goal: Find the “best” point that is inside the blue area.
    • The “best” point is the one with the lowest possible error (RSS).
    • Geometrically, this means we start at the center (\(\hat{\beta}\)) and expand our ellipse outward. The very first point where the ellipse touches the blue constraint region is our solution.

Why LASSO Performs Variable Selection (The Diamond) 🎯

This is the most important concept. Look at the LASSO diagrams.

  • The Shape: The LASSO constraint is a diamond.
  • The Key Feature: This diamond has sharp corners (vertices). And most importantly, these corners lie exactly on the axes.
    • The top corner is at \((\beta_1=0, \beta_2=s)\).
    • The right corner is at \((\beta_1=s, \beta_2=0)\).
  • The “Collision”: Now, imagine the red ellipses (representing our error) expanding from the OLS solution (\(\hat{\beta}\)). They will almost always “hit” the blue diamond at one of its sharp corners.
    • Look at your textbook diagram (slide ...000304.png). The ellipse clearly makes contact with the diamond at the top corner, where \(\beta_1 = 0\).
    • Look at your example (slide ...000259.jpg). The center of the ellipses is at (4, 0.1). The closest point on the diamond that the expanding ellipses will hit is the corner at (2, 0). At this solution, \(y\) is exactly 0.

Conclusion: Because the \(L_1\) “diamond” has corners on the axes, the optimal solution is very likely to land on one of them. When it does, the coefficient for the other axis is set to exactly zero. This is the variable selection property.

Why Ridge Regression Only Shrinks (The Circle) 🤏

Now, look at the Ridge regression diagram.

  • The Shape: The Ridge constraint is a circle.
  • The Key Feature: A circle is perfectly smooth and has no corners.
  • The “Collision”: Imagine the same ellipses expanding and hitting the blue circle. The contact point will be a tangent point.
    • Because the circle is round, this tangent point can be anywhere on its circumference.
    • It is extremely unlikely that the contact point will be exactly on an axis (e.g., at \((\beta_1=0, \beta_2=s)\)). This would only happen if the OLS solution \(\hat{\beta}\) was already perfectly aligned with that axis.
  • Conclusion: The Ridge solution will find a point where both \(\beta_1\) and \(\beta_2\) are non-zero. The coefficients are “shrunk” (pulled in from \(\hat{\beta}\) towards the origin), but they never become zero. This is why Ridge is called a “shrinkage” method, but not a “variable selection” method.

Summary: Diamond vs. Circle

Feature LASSO (\(L_1\) Norm) Ridge (\(L_2\) Norm)
Constraint Shape Diamond (or hyper-rhombus) Circle (or hypersphere)
Key Feature Sharp corners on the axes Smooth curve with no corners
Geometric Solution Ellipses hit the corners Ellipses hit a smooth part
Result Forces some coefficients to exactly 0 Shrinks all coefficients towards 0
Name Variable Selection Shrinkage

The “space meaning” is that the sharp corners of the \(L_1\) diamond are what make variable selection possible. The smooth circle of the \(L_2\) norm does not have these corners and thus cannot force coefficients to zero.

8. Shrinkage Methods (Lasso vs. Ridge)

Core Concept: Shrinkage Methods

Both Ridge (L2) and Lasso (L1) are regularization techniques used to improve upon standard Ordinary Least Squares (OLS) regression.

Their main goal is to manage the bias-variance tradeoff. OLS often has low bias but very high variance, especially when you have many predictors (\(p\)) or when predictors are correlated. Ridge and Lasso improve prediction accuracy by shrinking the regression coefficients towards zero. This adds a small amount of bias but significantly reduces the variance, leading to a lower overall Test Mean Squared Error (MSE).

The Key Difference: Math & How They Shrink

The slides show that the two methods use different penalties, which leads to very different mathematical forms and practical outcomes.

  • Ridge Regression (L2 Penalty): Minimizes \(RSS + \lambda \sum_{j=1}^{p} \beta_j^2\)
  • Lasso Regression (L1 Penalty): Minimizes \(RSS + \lambda \sum_{j=1}^{p} |\beta_j|\)

Slide 80 provides the exact formulas for their coefficient estimates in a simple, orthogonal case (where predictors are independent):

Ridge Regression (Proportional Shrinkage)

  • Formula: \(\hat{\beta}_j^R = \hat{\beta}_j^{LSE} / (1 + \lambda)\)
  • What this means: Ridge shrinks every least squares coefficient by a proportional amount. It will make coefficients smaller, but it will never set them to exactly zero (unless \(\lambda\) is \(\infty\)).

Lasso Regression (Soft-Thresholding)

  • Formula: \(\hat{\beta}_j^L = \text{sign}(\hat{\beta}_j^{LSE})(|\hat{\beta}_j^{LSE}| - \lambda/2)_+\)
  • What this means: This is a “soft-thresholding” operator.
    • If the original coefficient \(\hat{\beta}_j^{LSE}\) is small (its absolute value is less than \(\lambda/2\)), Lasso sets it to exactly zero.
    • If the coefficient is large, Lasso subtracts \(\lambda/2\) from its absolute value, shrinking it towards zero.
  • Key Property: Because of this, Lasso performs automatic feature selection by eliminating predictors.

Important Images Explained

Most Important: Figure 6.10 (Slide 82)

This is the best visual for understanding the mathematical difference from the formulas above.

  • Left (Ridge): The red line shows the Ridge estimate vs. the OLS estimate. It’s a straight, diagonal line with a slope less than 1. It shrinks everything proportionally.
  • Right (Lasso): The red line shows the Lasso estimate. It’s “flat” at zero for a range, showing it sets small coefficients to zero. Then, it slopes up, but it’s shifted (it shrinks the large coefficients by a fixed amount).

Scenario 1: Figure 6.8 (Slide 76)

This plot shows what happens when all 45 predictors are truly related to the response.

  • Result (Slide 77): Ridge performs slightly better (has a lower minimum MSE, shown by the dotted purple line).
  • Why: Lasso’s assumption (that some coefficients are zero) is wrong in this case. By forcing some relevant predictors to zero, it adds too much bias. Ridge, by just shrinking all of them, finds a better balance.

Scenario 2: Figure 6.9 (Slide 78)

This plot shows the opposite scenario: only 2 out of 45 predictors are truly related (a “sparse” model).

  • Result: Lasso performs much better (its solid purple line has a much lower minimum MSE).
  • Why: Lasso’s assumption is correct. It successfully sets the 43 “noise” predictors to zero, which dramatically reduces variance, while correctly keeping the 2 important ones.

Python & Code Understanding

The slides don’t contain Python code, but they describe the exact concepts you would use, primarily in scikit-learn.

  • Implementing Ridge & Lasso:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import make_pipeline

    # It's crucial to scale data before regularization
    # alpha is the same as the λ (lambda) in your slides

    # --- Ridge ---
    # The math for Ridge is a "closed-form solution" (Slide 80)
    # ridge_model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))

    # --- Lasso ---
    # Lasso requires a numerical solver (like coordinate descent)
    # lasso_model = make_pipeline(StandardScaler(), Lasso(alpha=0.1))
  • The Soft-Thresholding Formula: The math from Slide 80, \(\text{sign}(y)(|y| - \lambda/2)_+\), is the core operation in the “coordinate descent” algorithm used to solve Lasso. You could write it in Python/Numpy:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    import numpy as np

    def soft_threshold(x, lambda_val):
    """Implements the Lasso soft-thresholding formula."""
    return np.sign(x) * np.maximum(0, np.abs(x) - (lambda_val / 2))

    # Example:
    # ols_coefficient = 1.5
    # threshold = 4.0
    # lasso_coefficient = soft_threshold(ols_coefficient, threshold)
    # print(lasso_coefficient) # Output: 0.0

    # ols_coefficient = 3.0
    # threshold = 4.0
    # lasso_coefficient = soft_threshold(ols_coefficient, threshold)
    # print(lasso_coefficient) # Output: 1.0 (it was 3.0, shrunk by 4/2 = 2)
  • Choosing \(\lambda\) (alpha): Slide 79 says to “Use cross validation to determine which one has better prediction.” In scikit-learn, this is done for you with RidgeCV and LassoCV, which automatically test a range of alpha values.

Summary: Lasso vs. Ridge

Feature Ridge (L2) Lasso (L1)
Penalty \(L_2\) norm: \(\lambda \sum \beta_j^2\) \(L_1\) norm: \(\lambda \sum |\beta_j|\)
Coefficient Shrinkage Proportional; shrinks all coefficients, but never to exactly zero. Soft-thresholding; can force coefficients to be exactly zero.
Feature Selection? No Yes, this is its main advantage.
Interpretability Less interpretable (keeps all \(p\) variables). More interpretable (produces a “sparse” model with fewer variables).
Best Used When… …most predictors are useful. (e.g., Slide 76: 45/45 relevant). …many predictors are “noise” and only a few are strong. (e.g., Slide 78: 2/45 relevant).
Computation Has a simple, closed-form solution. Requires numerical optimization (e.g., coordinate descent).

9. Shrinkage Methods (Ridge & LASSO)

Summary of Shrinkage Methods (Ridge & LASSO)

These slides introduce shrinkage methods, also known as regularization, a technique used in regression (like linear regression) to improve model performance. The main idea is to add a penalty to the model’s loss function to “shrink” the size of the coefficients. This helps to reduce model variance and prevent overfitting, especially when you have many features.

The two main methods discussed are Ridge Regression (\(L_2\) penalty) and LASSO (\(L_1\) penalty).

Key Mathematical Formulas

  1. Standard Linear Model: The problem starts with the standard linear regression model (from slide 1):

    \[ \]$$\mathbf{y} = \mathbf{X}\beta + \epsilon

    \[ \]$$ * \(\mathbf{y}\) is the \(n \times 1\) vector of observed outcomes.

    • \(\mathbf{X}\) is the \(n \times p\) matrix of \(p\) predictor features for \(n\) observations.
    • \(\beta\) is the \(p \times 1\) vector of coefficients (what we want to find).
    • \(\epsilon\) is the \(n \times 1\) vector of random errors.
    • The goal of standard “Ordinary Least Squares” (OLS) regression is to find the \(\beta\) that minimizes the loss: \(\|\mathbf{X}\beta - \mathbf{y}\|^2_2\).
  2. LASSO (L1 Regularization): LASSO (Least Absolute Shrinkage and Selection Operator) adds a penalty based on the absolute value of the coefficients (the \(L_1\)-norm). This is the key formula from slide 1:

    \[ \]$$\hat{\beta}(\lambda) \leftarrow \arg \min_{\beta} \left( |\mathbf{X}\beta - \mathbf{y}|^2_2 + \lambda|\beta|_1 \right)

    \[ \]$$ * \(\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|\)

    • \(\lambda\) (lambda) is the tuning parameter that controls the strength of the penalty. A larger \(\lambda\) means more shrinkage.
    • Key Property (Variable Selection): The \(L_1\) penalty can force some coefficients (\(\beta_j\)) to become exactly zero. This means LASSO simultaneously performs feature selection by automatically removing irrelevant predictors.
    • Support (Slide 1): The question “Can it recover the support of \(\beta\)?” is asking if LASSO can correctly identify the set of true non-zero coefficients (defined as \(S := \{j : \beta_j \neq 0\}\)).
  3. Ridge Regression (L2 Regularization): Ridge regression (mentioned on slide 2, shown on slide 3) adds a penalty based on the squared value of the coefficients (the \(L_2\)-norm).

    \[ \]$$\hat{\beta}(\lambda) \leftarrow \arg \min_{\beta} \left( |\mathbf{X}\beta - \mathbf{y}|^2_2 + \lambda|\beta|^2_2 \right)

    \[ \]$$ * \(\|\beta\|^2_2 = \sum_{j=1}^{p} \beta_j^2\)

    • Key Property (Shrinkage): The \(L_2\) penalty shrinks coefficients towards zero but never sets them to exactly zero (unless \(\lambda = \infty\)). It is effective at handling multicollinearity.

Important Images & Concepts

The most important images are the plots from slides 3 and 4. They illustrate the two most critical concepts: how to choose \(\lambda\) and what the penalty does to the coefficients.

Tuning Parameter Selection (Slides 3 & 4, Left Plots)

  • Problem: How do you find the best value for \(\lambda\)?
  • Solution: Cross-Validation (CV). The slides show 10-fold CV.
  • What the Plots Show: The left plots on slides 3 and 4 show the Cross-Validation Error (like MSE) for different values of the penalty.
    • The x-axis represents the penalty strength (either \(\lambda\) itself or a related measure like the shrinkage ratio \(\|\hat{\beta}_\lambda\|_1 / \|\hat{\beta}\|_1\)).
    • The y-axis is the prediction error.
    • The curve is typically U-shaped. The vertical dashed line marks the minimum of this curve. This minimum point corresponds to the optimal \(\lambda\), which provides the best balance between bias and variance, leading to the best-performing model on unseen data.

Coefficient Paths (Slides 3 & 4, Right Plots)

These “trace” plots are crucial for understanding the difference between Ridge and LASSO. They show how the value of each coefficient (y-axis) changes as the penalty strength (x-axis) changes.

  • Slide 3 (Ridge): As \(\lambda\) increases (moving right), all coefficient values are smoothly shrunk towards zero, but none of them actually hit zero.
  • Slide 4 (LASSO): As the penalty increases (moving from right to left, as the ratio \(s\) goes from 1.0 to 0.0), you can see coefficients “drop off” and become exactly zero one by one. The model with the optimal \(\lambda\) (vertical line) has selected only a few non-zero coefficients (the pink and teal lines), while all the grey lines have been set to zero. This is feature selection in action.

Key Discussion Points (Slide 2)

  • Non-linear models: You can apply these methods to non-linear models by first creating non-linear features (e.g., \(x_1^2\), \(x_2^2\), \(x_1 \cdot x_2\)) and then feeding them into a LASSO or Ridge model. The regularization will then select which of these linear or non-linear terms are important.
  • Correlated Features (Multicollinearity): The question “If \(x_j \approx x_k\), how does LASSO behave?” is a key weakness of LASSO.
    • LASSO: Tends to arbitrarily select one of the correlated features and set the others to zero. This can make the model unstable.
    • Ridge: Tends to shrink the coefficients of correlated features together, giving them similar (but smaller) values.
    • Elastic Net (not shown) is a hybrid of Ridge and LASSO that is often used to get the best of both worlds: it can select groups of correlated variables.

Python Code Understanding (using scikit-learn)

Here is how you would implement these concepts in Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Import necessary libraries
import numpy as np
from sklearn.linear_model import Lasso, Ridge, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# --- Assume you have your data ---
# X: your feature matrix (e.g., shape 100, 20)
# y: your target vector (e.g., shape 100,)
# X, y = ... load your data ...

# 1. It's crucial to scale your data before regularization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Find the optimal lambda (alpha) using Cross-Validation
# scikit-learn uses 'alpha' instead of 'lambda' for the tuning parameter.

# --- For LASSO ---
# LassoCV automatically performs cross-validation (e.g., cv=10)
# to find the best alpha.
lasso_cv_model = LassoCV(cv=10, random_state=0)
lasso_cv_model.fit(X_scaled, y)

# Get the best alpha (lambda)
best_alpha_lasso = lasso_cv_model.alpha_
print(f"Optimal alpha (lambda) for LASSO: {best_alpha_lasso}")

# Get the final coefficients
lasso_coeffs = lasso_cv_model.coef_
print(f"LASSO coefficients: {lasso_coeffs}")
# You will see that many of these are exactly 0.0

# --- For Ridge ---
# RidgeCV works similarly. It's often good to test alphas on a log scale.
ridge_alphas = np.logspace(-3, 3, 100) # 100 values from 0.001 to 1000
ridge_cv_model = RidgeCV(alphas=ridge_alphas, store_cv_values=True)
ridge_cv_model.fit(X_scaled, y)

# Get the best alpha (lambda)
best_alpha_ridge = ridge_cv_model.alpha_
print(f"Optimal alpha (lambda) for Ridge: {best_alpha_ridge}")

# Get the final coefficients
ridge_coeffs = ridge_cv_model.coef_
print(f"Ridge coefficients: {ridge_coeffs}")
# You will see these are small, but not exactly zero.

Bias-variance tradeoff

Key Mathematical Formulas & Concepts

LASSO: Sign Consistency

This is the “ideal” scenario for LASSO. Sign consistency means that, with enough data, the LASSO model not only selects the correct set of features (it recovers the “support” \(S\)) but also correctly identifies the sign (positive or negative) of their coefficients.

  • The Goal (Slide 1):

    \[ \]$$\text{sign}(\hat{\beta}(\lambda)) = \text{sign}(\beta)

    \[ \]$$This means the signs of our estimated coefficients \(\hat{\beta}(\lambda)\) match the signs of the true underlying coefficients \(\beta\).

  • The “Irrepresentable Condition” (Slide 1): This is the mathematical guarantee required for LASSO to achieve sign consistency.

    \[ \]$$|\mathbf{X}_{Sc}\top \mathbf{X}_S (\mathbf{X}_S^\top \mathbf{X}S)^{-1} \text{sign}(\beta_S)|\infty < 1

    \[ \]$$ * Plain English: This formula is a complex way of saying: The irrelevant features (\(\mathbf{X}_{S^c}\)) cannot be too strongly correlated with the true, relevant features (\(\mathbf{X}_S\)).

    • If an irrelevant feature is very similar (highly correlated) to a true feature, LASSO can get “confused” and might pick the wrong one, or its estimate will be unstable. This condition fails.

Ridge Regression: The Bias-Variance Tradeoff

  • The Formula (Slide 3):

    \[ \]$$\hat{\beta}{\text{ridge}}(\lambda) \leftarrow \arg \min{\beta} \left( |\mathbf{y} - \mathbf{X}\beta|^2 + \lambda|\beta|^2 \right)

    \[ \]$$(Note: This is the \(L_2\) penalty, so \(\|\beta\|^2 = \sum \beta_j^2\))

  • The Problem it Solves: Collinearity (Slide 2) When features are strongly correlated (e.g., \(x_i \approx x_j\)), regular methods fail:

    • LSE (OLS): Fails because the matrix \(\mathbf{X}^\top \mathbf{X}\) is “non-invertible” (or singular), so the math for the solution \(\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) breaks down.
    • LASSO: Fails because the Irrepresentable Condition is violated. LASSO will tend to arbitrarily pick one of the correlated features and set the others to zero.
  • The Ridge Solution (Slide 3):

    1. Always has a solution: Adding the \(\lambda\) penalty makes the matrix math work, even if \(\mathbf{X}^\top \mathbf{X}\) is non-invertible.
    2. Groups variables: This is the key takeaway. Instead of arbitrarily picking one feature, Ridge tends to shrink the coefficients of collinear variables together.
    3. Bias-Variance Tradeoff: Ridge introduces bias into the estimates (they are “wrong” on purpose) to massively reduce variance (they are more stable and less sensitive to the specific training data). This trade-off usually leads to a much lower overall error (Mean Squared Error).

Important Images & Key Takeaways

  1. Slide 2 (Collinearity Failures): This is the most important “problem” slide. It clearly explains why you can’t always use standard LSE or LASSO. The fact that all three methods (LSE, LASSO, Forward Selection) fail with strong collinearity motivates the need for Ridge.

  2. Slide 3 (Ridge Properties): This is the most important “solution” slide. The two most critical points are:

    • Always unique solution for λ > 0
    • Collinear variables tend to be grouped! (This is the “fix” for the problem on Slide 2).

Python Code Understanding

Let’s demonstrate the key difference (Slide 3) in how LASSO and Ridge handle collinear features.

We will create two features, x1 and x2, that are nearly identical.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
from sklearn.linear_model import Lasso, Ridge

# 1. Create a dataset with 2 strongly correlated features
np.random.seed(0)
n_samples = 100
# x1: a standard feature
x1 = np.random.randn(n_samples)
# x2: almost identical to x1
x2 = x1 + 0.01 * np.random.randn(n_samples)

# Combine into our feature matrix X
X = np.c_[x1, x2]

# y: The target variable (let's say y = 2*x1 + 2*x2)
y = 2 * x1 + 2 * x2 + np.random.randn(n_samples)

# 2. Fit LASSO (alpha is the same as lambda)
# We use a moderate alpha
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X, y)

# 3. Fit Ridge (alpha is the same as lambda)
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X, y)

# 4. Compare the coefficients
print("--- Results for Correlated Features ---")
print(f"True Coefficients: [2.0, 2.0]")
print(f"LASSO Coefficients: {np.round(lasso_model.coef_, 2)}")
print(f"Ridge Coefficients: {np.round(ridge_model.coef_, 2)}")

Example Output:

1
2
3
4
--- Results for Correlated Features ---
True Coefficients: [2.0, 2.0]
LASSO Coefficients: [3.89 0. ]
Ridge Coefficients: [1.95 1.94]

Code Explanation:

  • LASSO: As predicted by the slides, LASSO failed to find the true model. It arbitrarily picked x1, gave it a large coefficient, and set x2 to zero. This is unstable and not what we wanted.
  • Ridge: As predicted by Slide 3, Ridge handled the collinearity perfectly. It identified that both x1 and x2 were important and “grouped” them by assigning them nearly identical, stable coefficients (1.95 and 1.94), which are very close to the true values of 2.0.

10. Elastic Net

Overall Summary

These slides introduce Elastic Net, a modern regression method that solves the major weaknesses of its two predecessors, Ridge and LASSO regression.

  • Ridge is good for collinearity (correlated features) but can’t do variable selection (it can’t set any feature’s coefficient to exactly zero).
  • LASSO is good for variable selection (it creates sparse models by setting coefficients to zero) but behaves unstably when features are correlated (it tends to randomly pick one and discard the others).

Elastic Net combines the L1 penalty of LASSO and the L2 penalty of Ridge. The result is a single, flexible model that:

  1. Performs variable selection (like LASSO).
  2. Handles correlated features stably by grouping them together (like Ridge).
  3. Can select more features than samples (\(p > n\)), which LASSO cannot do.

Slide 1: The Definition and Formula (File: ...020245.png)

This slide explains why Elastic Net was created and defines it mathematically.

  • The Problem: It states the exact trade-off:
    • “Ridge regression can handle collinearity, but cannot perform variable selection;”
    • “LASSO can perform variable selection, but performs poorly when collinearity;”
  • The Solution (The Formula): The core of the method is this optimization formula: \[\hat{\beta}_{eNet}(\lambda, \alpha) \leftarrow \arg \min_{\beta} \left( \underbrace{\|\mathbf{y} - \mathbf{X}\beta\|^2}_{\text{Loss}} + \lambda \left( \underbrace{\alpha\|\beta\|_1}_{\text{L1 Penalty}} + \underbrace{\frac{1-\alpha}{2}\|\beta\|_2^2}_{\text{L2 Penalty}} \right) \right)\]
  • Breaking Down the Formula:
    • \(\|\mathbf{y} - \mathbf{X}\beta\|^2\): This is the standard “Residual Sum of Squares” (RSS). We want to find coefficients (\(\beta\)) that make the model’s predictions (\(X\beta\)) as close as possible to the true values (\(y\)).
    • \(\lambda\) (Lambda): This is the master knob for total regularization strength. A larger \(\lambda\) means a bigger penalty, which “shrinks” all coefficients more.
    • \(\alpha\) (Alpha): This is the mixing parameter that balances L1 and L2. This is the key innovation.
      • \(\alpha\|\beta\|_1\): This is the L1 (LASSO) part. It forces weak coefficients to become exactly zero, thus selecting variables.
      • \(\frac{1-\alpha}{2}\|\beta\|_2^2\): This is the L2 (Ridge) part. It shrinks all coefficients and, crucially, encourages correlated features to have similar coefficients (the grouping effect).
  • The Special Cases:
    • If \(\alpha = 0\), the L1 term vanishes, and the model becomes pure Ridge Regression.
    • If \(\alpha = 1\), the L2 term vanishes, and the model becomes pure LASSO Regression.
    • If \(0 < \alpha < 1\), you get Elastic Net, which “encourages grouping of correlated variables” and “can perform variable selection.”

Slide 2: The Intuition and The Grouping Effect (File: ...020249.jpg)

This slide gives you the visual intuition and the practical proof of why Elastic Net works. It has two parts.

Part 1: The Three Graphs (Geometric Intuition)

These graphs show the constraint region (the shaded shape) for each penalty. The model tries to find the best coefficients (\(\theta_{opt}\)), and the final solution (the green dot) is the first point where the cost function (the blue ellipses) “touches” the constraint region.

  • L1 Norm (LASSO): The region is a diamond. Because of its sharp corners, the ellipses are very likely to hit a corner first. At a corner, one of the coefficients (e.g., \(\theta_1\)) is zero. This is a visual explanation of how LASSO creates sparsity (variable selection).
  • L2 Norm (Ridge): The region is a circle. It has no corners. The ellipses will hit a “smooth” point on the circle, shrinking both coefficients (\(\theta_1\) and \(\theta_2\)) but not setting either to zero. This is weight sharing.
  • L1 + L2 (Elastic Net): The region is a “rounded square”. It’s the perfect compromise.
    • It has “corners” (like LASSO) so it can still set coefficients to zero.
    • It has “curved edges” (like Ridge) so it’s more stable and handles correlated variables by finding a solution on an edge rather than a single sharp corner.

Part 2: The Formula (The Grouping Effect)

The text at the bottom explains Elastic Net’s “grouping effect.”

  • The Implication: “If \(x_j \approx x_k\), then \(\hat{\beta}_j \approx \hat{\beta}_k\).”
  • Meaning: If two features (\(x_j\) and \(x_k\)) are highly correlated (their values are very similar), Elastic Net will force their coefficients (\(\hat{\beta}_j\) and \(\hat{\beta}_k\)) to also be very similar.
  • Why this is good: This is the opposite of LASSO. LASSO would be unstable and might arbitrarily set \(\hat{\beta}_j\) to a large value and \(\hat{\beta}_k\) to zero. Elastic Net “groups” them: it will either keep both in the model with similar importance, or it will shrink both of them out of the model together. This is a much more stable and realistic result.
  • The Warning: “LASSO may be unstable in this case!” This directly highlights the problem that Elastic Net solves.

Slide 3: The Feature Comparison Table (File: ...020255.png)

This table is your “cheat sheet” for choosing the right model. It compares Ridge, LASSO, and Elastic Net on all their key properties.

  • Penalty: Shows the L2, L1, and combined penalties.
  • Sparsity: Can the model set coefficients to 0?
    • Ridge: No ❌
    • LASSO: Yes ✅
    • Elastic Net: Yes ✅
  • Variable Selection: This is a crucial row.
    • LASSO: Yes ✅, BUT it has a major limitation: if you have more features than samples (\(p > n\)), LASSO can select at most \(n\) features.
    • Elastic Net: Yes ✅, and it can select more than \(n\) variables. This makes it the clear choice for “wide” data problems (e.g., in genomics, where \(p=20,000\) features and \(n=100\) samples).
  • Grouping Effect: How does it handle correlated features?
    • Ridge: Strong ✅
    • LASSO: Weak ❌ (it “picks one”)
    • Elastic Net: Strong ✅
  • Solution Uniqueness: Is the answer stable?
    • Ridge: Always ✅
    • LASSO: No ❌ (not if \(X\) is “rank-deficient,” e.g., \(p > n\) or correlated features)
    • Elastic Net: Always ✅ (as long as \(\alpha < 1\), the Ridge component guarantees a unique, stable solution).
  • Use Case: When should you use each?
    • Ridge: For prediction, especially with multicollinearity.
    • LASSO: For interpretability and creating sparse models (when you think only a few features matter).
    • Elastic Net: The best all-arounder. Use it for correlated predictors, when \(p \gg n\), or when you need both sparsity + stability.

Code Understanding (Python scikit-learn)

When you use this in Python, be aware of a common confusion in the parameter names:

Concept (from your slides) scikit-learn Parameter Description
\(\lambda\) (Lambda) alpha The overall strength of regularization.
\(\alpha\) (Alpha) l1_ratio The mixing parameter between L1 and L2.

Example: An l1_ratio of 0 is Ridge. An l1_ratio of 1 is LASSO. An l1_ratio of 0.5 is a 50/50 mix.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.linear_model import ElasticNet, ElasticNetCV

# 1. Initialize a specific model
# This uses 0.5 for lambda (slide's alpha) and 0.1 for lambda (slide's lambda)
model = ElasticNet(alpha=0.1, l1_ratio=0.5)

# 2. A much better way: Find the best parameters automatically
# This will test l1_ratios of 0.1, 0.5, and 0.9
# and automatically find the best 'alpha' (strength) for each.
cv_model = ElasticNetCV(
l1_ratio=[.1, .5, .9],
cv=5 # 5-fold cross-validation
)

# 3. Fit the model to your data (X_train, y_train)
# cv_model.fit(X_train, y_train)

# 4. See the best parameters it found
# print(f"Best l1_ratio (slide's alpha): {cv_model.l1_ratio_}")
# print(f"Best alpha (slide's lambda): {cv_model.alpha_}")

11. High-Dimensional Data Analysis

The Core Problem: Large \(p\), Small \(n\)

The slides introduce the challenge of high-dimensional data, which is defined by having many more features (predictors) \(p\) than observations (samples) \(n\). This is often written as \(p \gg n\).

  • Example: Predicting blood pressure (the response \(y\)) using millions of genetic markers (SNPs) as features \(X\), but only having data from a few hundred patients.
  • Troubles:
    • Overfitting: Models become “too flexible” and learn the noise in the training data, rather than the true underlying pattern.
    • Non-Unique Solution: When \(p > n\), the standard least squares linear regression model doesn’t even have a unique solution.
    • Misleading Metrics: This leads to a common symptom: a very small training error (or high \(R^2\)) but a very large test error.

Most Important Image: The Overfitting Trap (Figure 6.23)

Figure 6.23 (from the first uploaded image) is the most critical visual for understanding the problem. It shows what happens when you add features (variables) that are completely unrelated to the outcome.

  • Left Plot (R²): The \(R^2\) on the training data increases towards 1. This looks like a perfect fit.
  • Center Plot (Training MSE): The Mean Squared Error on the training data decreases to 0. This also looks perfect.
  • Right Plot (Test MSE): The Mean Squared Error on the test data (new, unseen data) explodes. This reveals the model is garbage and has just memorized the training set.

⚠️ This is the key takeaway: In high dimensions, \(R^2\) and training MSE are useless and misleading metrics for model quality.

The Solution: Regularization & Model Selection

To combat overfitting, we must use less flexible models. The main strategy is regularization (also called shrinkage), which involves adding a penalty term to the cost function to “shrink” the model coefficients (\(\beta\)).

Mathematical Formulas & Python Code 🐍

The standard Least Squares cost function you try to minimize is: \[\text{RSS} = \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 \quad \text{or} \quad \|y - X\beta\|^2_2\] This fails when \(p > n\). The solutions modify this:

A. Ridge Regression (\(L_2\) Penalty)

  • Concept: Shrinks all coefficients towards zero, but never to zero. It’s good when many features are related to the outcome.
  • Math Formula: \[\text{Minimize: } \left( \|y - X\beta\|^2_2 + \lambda \sum_{j=1}^p \beta_j^2 \right)\]
    • The \(\lambda \sum_{j=1}^p \beta_j^2\) is the \(L_2\) penalty.
    • \(\lambda\) (lambda) is a tuning parameter that controls the penalty strength. A larger \(\lambda\) means more shrinkage.
  • Python (Scikit-learn):
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    from sklearn.linear_model import Ridge
    from sklearn.model_selection import cross_val_score

    # alpha is the lambda (λ) tuning parameter
    # We find the best alpha using cross-validation
    ridge_model = Ridge(alpha=1.0)

    # Fit the model
    ridge_model.fit(X_train, y_train)

    # Evaluate using test error (e.g., MSE on test set)
    # NOT with training R-squared
    test_score = ridge_model.score(X_test, y_test)

B. The Lasso (\(L_1\) Penalty)

  • Concept: This is a very important method. The \(L_1\) penalty can force coefficients to be exactly zero. This means Lasso performs automatic feature selection, creating a sparse model.
  • Math Formula: \[\text{Minimize: } \left( \|y - X\beta\|^2_2 + \lambda \sum_{j=1}^p |\beta_j| \right)\]
    • The \(\lambda \sum_{j=1}^p |\beta_j|\) is the \(L_1\) penalty.
    • Again, \(\lambda\) is the tuning parameter.
  • Python (Scikit-learn):
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    from sklearn.linear_model import Lasso

    # alpha is the lambda (λ) tuning parameter
    lasso_model = Lasso(alpha=0.1)

    # Fit the model
    lasso_model.fit(X_train, y_train)

    # The model automatically selects features
    # Coefficients that are zero were 'dropped'
    print(lasso_model.coef_)

C. Other Methods

The slides also mention:

  • Forward Stepwise Selection: A different approach where you start with no features and add them one by one, picking the one that improves the model most (based on a criterion like cross-validation error).
  • Principal Components Regression (PCR): A dimensionality reduction technique.

The Curse of Dimensionality (Figure 6.24)

This example (Figures 6.24 and its description) shows a more subtle problem.

  • Setup: A model with \(n=100\) observations and 20 true features.
  • Plots: They test Lasso by adding more and more irrelevant features:
    • \(p=20\) (Left): Lasso performs well. The lowest test MSE is found with minimal regularization.
    • \(p=50\) (Center): Lasso still works well, but it needs more regularization (a smaller “Degrees of Freedom”) to filter out the 30 junk features.
    • \(p=2000\) (Right): This is the curse of dimensionality. Even with a good method like Lasso, the 1,980 irrelevant features add so much noise that the model performs poorly regardless of the tuning parameter. The true signal is “lost in the noise.”

Summary: Cautions for \(p > n\)

The final slide gives the most important rules to follow:

  1. Beware Extreme Multicollinearity: When \(p > n\), your features are mathematically guaranteed to be linearly related, which breaks standard regression.
  2. Don’t Overstate Results: A model you find (e.g., with Lasso) is just one of many potentially good models.
  3. 🚫 DO NOT USE training \(R^2\), \(p\)-values, or training MSE to justify your model. As Figure 6.23 showed, they are misleading.
  4. ✅ DO USE test error and cross-validation error to choose your model and assess its performance.

The Core Problem: \(p \gg n\) (The “Troubles” Slide)

This slide (filename: ...020259.png) sets up the entire problem. The issue isn’t just “overfitting”; it’s a fundamental mathematical breakdown of standard methods.

  • “Large \(p\) makes our linear regression model too flexible”: This is an understatement. It leads to a problem called an underdetermined system.
  • “If \(p > n\), the LSE is not even uniquely determined”: This is the most important technical point.
    • Mathematical Reason: The standard solution for Ordinary Least Squares (OLS) is \(\hat{\beta} = (X^T X)^{-1} X^T y\).
    • \(X\) is the data matrix with \(n\) rows (observations) and \(p\) columns (features).
    • The matrix \(X^T X\) has dimensions \(p \times p\).
    • When \(p > n\), the \(X^T X\) matrix is singular, which means its determinant is zero and it cannot be inverted. The \((X^T X)^{-1}\) term does not exist.
    • “Extreme multicollinearity” (from slide ...020744.png) is the direct cause. When \(p > n\), the columns of \(X\) (the features) are guaranteed to be linearly dependent. There are infinite combinations of the features that can explain the data.

The Simplest Example: \(n=2\) (Figure 6.22)

This slide (filename: ...020728.png) is the perfect illustration of the “not uniquely determined” problem.

  • Left Plot (Low-D): Many points (\(n\)), only two parameters (\(p=2\): intercept \(\beta_0\) and slope \(\beta_1\)). The line is a “best fit” that balances the errors. The training error (RSS) is non-zero.
  • Right Plot (High-D): We have \(n=2\) observations and \(p=2\) parameters.
    • You have two equations (one for each point) and two unknowns (\(\beta_0\) and \(\beta_1\)).
    • The model has exactly enough flexibility to pass perfectly through both points.
    • The result is zero training error.
    • This “perfect” fit is an illusion. If you got a new data point, this line would almost certainly be a terrible predictor. This is the essence of overfitting.

The Consequence: Misleading Metrics (Figure 6.23)

This slide (filename: ...020730.png) scales up the problem from \(n=2\) to \(n=20\) and shows why you must be cautious.

  • The Setup: \(n=20\) observations. We start with 1 feature and add more and more irrelevant, junk features.
  • Left Plot (\(R^2\)): The \(R^2\) on the training data steadily increases towards 1 as we add features. This is because, by pure chance, each new junk feature can explain a tiny bit more of the noise in the training set.
  • Center Plot (Training MSE): The training error drops to 0. This is the same as the \(n=2\) plot. Once the number of features (\(p\)) gets close to the number of observations (\(n=20\)), the model can perfectly fit the 20 data points, even if the features are random noise.
  • Right Plot (Test MSE): This is the “truth.” The actual error on new, unseen data gets worse and worse. By adding noise features, we are just “memorizing” the training set, and our model’s ability to generalize is destroyed.
  • Key Lesson: (from slide ...020744.png) This is why you must “Avoid using… \(p\)-values, \(R^2\), or other traditional measures of model on training as evidence of good fit.” They are guaranteed to lie to you when \(p > n\).

The Solutions (The “Deal with…” Slide)

This slide (filename: ...020734.png) lists the strategies to fix this. The core idea is regularization (or shrinkage). We add a “penalty” to the cost function to stop the \(\beta\) coefficients from getting too large or too numerous.

A. Ridge Regression (\(L_2\) Penalty)

  • Concept: Keeps all \(p\) features, but shrinks their coefficients. It’s excellent for handling multicollinearity.
  • Math: \(\text{Minimize: } \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 + \lambda \sum_{j=1}^p \beta_j^2\)
    • The first part is the standard RSS.
    • The \(\lambda \sum \beta_j^2\) is the \(L_2\) penalty. It punishes large coefficient values.
  • \(\lambda\) (Lambda): This is the tuning parameter.
    • If \(\lambda=0\), it’s just OLS (which fails).
    • If \(\lambda \to \infty\), all \(\beta\)’s are shrunk to 0.
    • The right \(\lambda\) is chosen via cross-validation.

B. The Lasso (\(L_1\) Penalty)

  • Concept: This is often preferred because it performs automatic feature selection. It shrinks many coefficients to be exactly zero.
  • Math: \(\text{Minimize: } \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 + \lambda \sum_{j=1}^p |\beta_j|\)
    • The \(\lambda \sum |\beta_j|\) is the \(L_1\) penalty. This absolute value penalty is what allows coefficients to become exactly 0.
  • Benefit: The final model is sparse (e.g., it might say “out of 2,000 features, only these 15 matter”).

C. Tuning Parameter Choice (The Real Work)

How do you pick the best \(\lambda\)? You must use the data you have. The slides mention this and “cross validation error” (from ...020744.png).

  • Python Code (Scikit-learn): You don’t just guess alpha (which is \(\lambda\) in scikit-learn). You use a tool like LassoCV or GridSearchCV to find the best one.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    from sklearn.linear_model import LassoCV
    from sklearn.datasets import make_regression

    # Create a high-dimensional dataset
    X, y = make_regression(n_samples=100, n_features=500, n_informative=10, noise=0.1)

    # LassoCV automatically performs cross-validation to find the best alpha (lambda)
    # cv=10 means 10-fold cross-validation
    lasso_cv_model = LassoCV(cv=10, random_state=0, max_iter=10000)

    # Fit the model
    lasso_cv_model.fit(X, y)

    # This is the best lambda (alpha) it found:
    print(f"Best alpha (lambda): {lasso_cv_model.alpha_}")

    # You can now see the coefficients
    # Most of the 500 coefficients will be 0.0
    print(f"Number of non-zero features: {np.sum(lasso_cv_model.coef_ != 0)}")

A Final Warning: The Curse of Dimensionality (Figure 6.24)

This final set of slides (filenames: ...020738.png and ...020741.jpg) provides a crucial, subtle warning: Regularization is not magic.

  • The Setup: \(n=100\) observations. There are 20 real features that truly affect the response.
  • The Experiment: They run Lasso three times, adding more and more noise features:
    • Left Plot (\(p=20\)): All 20 features are real. The lowest test MSE is found with minimal regularization (high “Degrees of Freedom,” meaning many non-zero coefficients). This makes sense; you want to keep all 20 real features.
    • Center Plot (\(p=50\)): Now we have 20 real features + 30 noise features. Lasso still works! The best model is found with more regularization (fewer “Degrees of Freedom”). Lasso successfully “zeroed out” many of the 30 noise features.
    • Right Plot (\(p=2000\)): This is the curse of dimensionality. We have 20 real features + 1980 noise features. The noise has completely overwhelmed the signal. Lasso fails. The test MSE is high no matter what tuning parameter you choose. The model cannot distinguish the 20 real features from the 1980 junk ones.

Final Takeaway: Even with advanced methods like Lasso, if your \(p \gg n\) problem is too extreme (i.S. the signal-to-noise ratio is too low), it may be impossible to build a good predictive model.

The Goal: “Collaborative Filtering”

The first slide (...021218.png) uses the term Collaborative Filtering. This is the key concept. The model “collaborates” by using the ratings of all users to fill in the blanks for a single user.

  • How it works: The model assumes your “taste” (vector \(\mathbf{u}_i\)) can be described as a combination of \(r\) “latent features” (e.g., \(r=3\): % action, % comedy, % drama). It also assumes each movie (vector \(\mathbf{v}_j\)) has a profile on these same features.
  • Your predicted rating for a movie is the dot product of your taste vector and the movie’s feature vector.
  • The model finds the best “taste” vectors \(\mathbf{U}\) and “movie” vectors \(\mathbf{V}\) that explain all the known ratings simultaneously. It’s collaborative because Lee’s ratings help define the features of “Bullet Train” (\(\mathbf{v}_2\)), which in turn helps predict Yang’s rating for that same movie.

The Hard Problem (and its 2 Flavors)

The second slide (...021222.png) presents the intuitive, but computationally very hard, way to frame the problem.

Detail 1: Noise vs. No Noise

The slide shows \(\mathbf{Y} = \mathbf{M} + \mathbf{E}\). This is critical. * \(\mathbf{M}\) is the “true,” “clean,” underlying low-rank matrix of everyone’s “true” preferences. * \(\mathbf{E}\) is a matrix of random noise. (e.g., your true rating is 4.3, but you entered a 4; or you were in a bad mood and rated a 3). * \(\mathbf{Y}\) is the noisy data we actually observe.

Because of this noise, we don’t expect to find a matrix \(\mathbf{N}\) that perfectly matches our data. Instead, we try to find a low-rank \(\mathbf{N}\) that is as close as possible. This leads to the formula: \[\underset{\text{rank}(\mathbf{N}) \le r}{\text{minimize}} \quad \left\| \mathcal{P}_{\mathcal{O}}(\mathbf{Y} - \mathbf{N}) \right\|_{\text{F}}^2\] This says: “Find a matrix \(\mathbf{N}\) (of rank \(r\) or less) that minimizes the sum of squared errors only on the ratings we observed (\(\mathcal{O}\)).”

Detail 2: Why is \(\text{rank}(\mathbf{N}) \le r\) a “Non-convex constraint”?

This is the “difficult to optimize” part. A convex problem is (simplistically) one with a single valley, making it easy to find the single lowest point. A non-convex problem has many local valleys, and an algorithm can get stuck in a “pretty good” valley instead of the “best” one.

The rank constraint is non-convex. For example, the average of two rank-1 matrices is not necessarily a rank-1 matrix (it could be rank-2). This lack of a “smooth valley” property makes the problem NP-hard.

Detail 3: The Number of Parameters: \(r(d_1 + d_2)\)

The slide asks, “how many entries are needed?” The answer is based on the number of unknown parameters. * A rank-\(r\) matrix \(\mathbf{M}\) can be factored into \(\mathbf{U}\) (which is \(d_1 \times r\)) and \(\mathbf{V}^T\) (which is \(r \times d_2\)). * The number of entries in \(\mathbf{U}\) is \(d_1 \times r\). * The number of entries in \(\mathbf{V}\) is \(d_2 \times r\). * Total “unknowns” to solve for: \(d_1 r + d_2 r = r(d_1 + d_2)\). * This means we must have at least \(r(d_1 + d_2)\) observed ratings to have any hope of uniquely solving for \(\mathbf{U}\) and \(\mathbf{V}\). If our number of observations \(|\mathcal{O}|\) is less than this, the problem is hopelessly underdetermined.

The “Magic” Solution: Convex Relaxation

The final slide (...021225.png) presents the groundbreaking solution from Candès and Recht. This solution cleverly changes the problem to one that is convex and solvable.

Detail 1: The L1-Norm Analogy (This is the most important concept)

This is the key to understanding why this works.

  • In Vectors (Lasso):
    • Hard Problem: Find the sparsest vector \(\beta\) (fewest non-zeros). This is \(L_0\) norm, \(\text{minimize } \|\beta\|_0\). This is non-convex.
    • Easy Problem: Minimize the \(L_1\) norm, \(\text{minimize } \|\beta\|_1 = \sum |\beta_j|\). This is convex, and it’s a “relaxation” that also produces sparse solutions.
  • In Matrices (Matrix Completion):
    • Hard Problem: Find the lowest-rank matrix \(\mathbf{X}\). Rank is the number of non-zero singular values. This is \(\text{minimize } \text{rank}(\mathbf{X})\). This is non-convex.
    • Easy Problem: Minimize the Nuclear Norm, \(\text{minimize } \|\mathbf{X}\|_* = \sum \sigma_i(\mathbf{X})\) (where \(\sigma_i\) are the singular values). This is convex, and it’s the “matrix equivalent” of the \(L_1\) norm. It’s a relaxation that also produces low-rank solutions.

Detail 2: Noiseless vs. Noisy (Again)

Notice the constraint in this new problem: \[\text{Minimize } \quad \|\mathbf{X}\|_*\] \[\text{Subject to } \quad X_{ij} = M_{ij}, \quad (i, j) \in \mathcal{O}\]

This formulation is for the noiseless case. It assumes the \(M_{ij}\) we observed are perfectly accurate. It demands that our solution \(\mathbf{X}\) exactly matches the known ratings. This is different from the optimization problem on the previous slide, which just tried to get close to the noisy data \(\mathbf{Y}\).

(In practice, you solve a noisy-aware version that combines both ideas, but the slide shows the original, “exact completion” problem.)

Detail 3: The Guarantee (What the math at the bottom means)

\[\text{If } \mathcal{O} \text{ is randomly sampled and } |\mathcal{O}| \gg r(d_1+d_2)\log(d_1+d_2), \text{... then the solution is unique and } \mathbf{M} \text{...}\]

This is the punchline. The Candès paper proved that if you have enough (but still very few) randomly sampled ratings, solving this easy convex problem (minimizing the nuclear norm) will magically give you the exact, true, low-rank matrix \(\mathbf{M}\).

  • \(|\mathcal{O}| \gg r(d_1+d_2)\): This part makes sense. We need at least as many observations as our \(r(d_1+d_2)\) degrees of freedom.
  • \(\log(d_1+d_2)\): This “log” factor is the “price” we pay for not knowing where the information is. It’s an astonishingly small price.
  • Example: For a 1,000,000 user x 10,000 movie matrix (like Netflix) with \(r=10\), you don’t need \(\approx 10^{10}\) ratings. You need a number closer to \(10 \times (10^6 + 10^4) \times \log(\dots)\), which is dramatically smaller. This is why this method is practical.