0%

MSDM 5054 - Statistical Machine Learning-L5

统计机器学习Lecture-5

Lecturer: Prof.XIA DONG

1. Resampling

Resampling as a statistical tool to assess the accuracy of models whose main goal is to estimate the test error (a model’s performance on new, unseen data) because the training error is overly optimistic due to overfitting.

重采样是一种统计工具,用于评估模型的准确性,其主要目标是估计测试误差(模型在新的、未见过的数据上的表现),因为由于过拟合导致训练误差过于乐观。

Key Concepts

  • Resampling: The process of repeatedly drawing samples from a dataset. The two main types mentioned are Cross-validation (to estimate model test error) and Bootstrap (to quantify the uncertainty of estimates). 从数据集中反复抽取样本的过程。主要提到的两种类型是交叉验证(用于估计模型测试误差)和自举(用于量化估计的不确定性)。
  • Data Splitting (Ideal Scenario): In a “data-rich” situation, you split your data into three parts: **在“数据丰富”的情况下,您可以将数据拆分为三部分:
    1. Training Data: Used to fit and train the parameters of various models.用于拟合和训练各种模型的参数。
    2. Validation Data: Used to assess the trained models, tune hyperparameters (e.g., choose the polynomial degree), and select the best model. This helps prevent overfitting.用于评估已训练的模型、调整超参数(例如,选择多项式的次数)并选择最佳模型。这有助于防止过度拟合。
    3. Test Data: Used only once on the final, selected model to get an unbiased estimate of its real-world performance. 在最终选定的模型上仅使用一次,以获得其实际性能的无偏估计。
  • Validation vs. Test Data: The slides emphasize this difference (Slide 7). The validation set is part of the model-building and selection process. The test set is kept separate and is only used for the final report card after all decisions are made.验证集是模型构建和选择过程的一部分。测试集是独立的,仅在所有决策完成后用于最终报告。

The Validation Set Approach

This is the simplest cross-validation method.这是最简单的交叉验证方法。

  1. Split: The total dataset is randomly divided into two parts: a training set and a validation set (often a 50/50 or 70/30 split).将整个数据集随机分成两部分:训练集验证集(通常为 50/50 或 70/30 的比例)。
  2. Train: Various models are fit only on the training set.各种模型训练集上进行拟合。
  3. Validate: The performance of each trained model is evaluated using the validation set. 使用验证集评估每个训练模型的性能。
  4. Select: The model with the best performance (e.g., the lowest error) on the validation set is chosen as the final model. 选择在验证集上性能最佳(例如,误差最小)的模型作为最终模型。

Important Image: Schematic (Slide 10)

This diagram clearly shows a set of \(n\) observations being randomly split into a training set (blue, with observations 7, 22, 13) and a validation set (beige, with observation 91). The model learns from the blue set and is tested on the beige set. 此图清晰地展示了一组 \(n\) 个观测值被随机分成训练集(蓝色,观测值编号为 7、22、13)和验证集(米色,观测值编号为 91)。模型从蓝色数据集进行学习,并在米色数据集上进行测试。

Example: Auto Data (Formulas & Code)

The slides use the Auto dataset to decide the best polynomial degree to predict mpg from horsepower.

Mathematical Models

The models being compared are polynomials of different degrees. For example:

  • Linear: \(mpg = \beta_0 + \beta_1(horsepower)\)

  • Quadratic: \(mpg = \beta_0 + \beta_1(horsepower) + \beta_2(horsepower)^2\)

  • Cubic: \(mpg = \beta_0 + \beta_1(horsepower) + \beta_2(horsepower)^2 + \beta_3(horsepower)^3\)

  • 线性\(mpg = \beta_0 + \beta_1(马力)\)

  • 二次\(mpg = \beta_0 + \beta_1(马力) + \beta_2(马力)^2\)

  • 三次\(mpg = \beta_0 + \beta_1(马力) + \beta_2(马力)^2 + \beta_3(马力)^3\)

The performance metric used is the Mean Squared Error (MSE) on the validation set: 使用的性能指标是验证集上的均方误差 (MSE)\[MSE_{val} = \frac{1}{n_{val}} \sum_{i \in val} (y_i - \hat{f}(x_i))^2\] where \(n_{val}\) is the number of observations in the validation set, \(y_i\) is the true mpg value, and \(\hat{f}(x_i)\) is the model’s prediction for the \(i\)-th observation in the validation set. 其中 \(n_{val}\) 是验证集中的观测值数量, \(y_i\) 是真实的 mpg 值,\(\hat{f}(x_i)\) 是模型对验证集中第 \(i\) 个观测值的预测。 ### Important Image: Polynomial Fits (Slide 8) 多项式拟合(幻灯片 8)

This plot is crucial. It shows the Auto data with linear (red), quadratic (green), and cubic (blue) regression lines. * The linear fit is clearly poor. * The quadratic and cubic fits follow the data’s curve much better. * The inset box shows the MSE calculated on the full dataset (this is training MSE): * Linear MSE: ~26.42 * Quadratic MSE: ~21.60 * Cubic MSE: ~21.51 This suggests a non-linear fit is necessary, but it doesn’t tell us which one will generalize better.

这张图至关重要。它用线性(红色)、二次(绿色)和三次(蓝色)回归线展示了 Auto 数据。 * 线性拟合 明显较差。 * 二次和三次拟合 更能贴合数据曲线。 * 插图显示了基于 完整数据集 计算的均方误差(这是训练均方误差): * 线性均方误差:~26.42 * 二次均方误差:~21.60 * 三次均方误差:~21.51 这表明非线性拟合是必要的,但它并没有告诉我们哪种拟合方式的泛化效果更好。 ### Code Analysis

The slides show two different approaches in code:

1. Python Code (Slide 9): Model Selection Criteria

  • What it does: This Python code (using pandas and statsmodels) does not implement the validation set approach. Instead, it fits polynomial models (degrees 1 through 5) to the entire dataset.
  • How it works: It calculates statistical criteria like BIC, Mallow’s \(C_p\), and Adjusted \(R^2\). These are mathematical adjustments to the training error that estimate the test error without needing a validation set. 它计算统计标准,例如BICMallow 的 \(C_p\)** 和调整后的 \(R^2\)。这些是对训练误差的数学调整,无需验证集即可估算测试误差。
  • Key line (logic): sm.OLS(y, X).fit() is used to fit the model, and then metrics like model.bic and model.rsquared_adj are extracted.
  • Result: The table shows that the model with [horsepower, horsepower2] (quadratic) has the lowest BIC and \(C_p\) values, suggesting it’s the best model according to these criteria.
  • 结果:表格显示,带有 [马力, 马力2](二次函数)的模型具有最低的 BIC 和 \(C_p\) 值,这表明根据这些标准,它是最佳模型。

2. R Code (Slides 14 & 15): The Validation Set Approach

  • What it does: This R code directly implements the validation set approach described on Slide 13.
  • How it works:
    1. set.seed(...): Sets a random seed to make the split reproducible.
    2. train=sample(392, 196): Randomly selects 196 indices (out of 392) to be the training set.
    3. lm.fit=lm(mpg~poly(horsepower, 2), ..., subset=train): Fits a quadratic model only using the train data.
    4. mean((mpg-predict(lm.fit,Auto))[-train]^2): This is the key calculation.
      • predict(lm.fit, Auto): Predicts mpg for all data.
      • [-train]: Selects only the predictions for the validation set (the data not in train).
      • mean(...): Calculates the MSE on the validation set.
  • Result: The code is run three times with different seeds (1, 2022, 1997).
    • Seed 1: Quadratic MSE (18.71) is lowest.
    • Seed 2022: Quadratic MSE (19.70) is lowest.
    • Seed 1997: Quadratic MSE (19.08) is lowest.
  • Main Takeaway: In all random splits, the quadratic model gives the lowest validation set MSE. This provides evidence that the quadratic model is the best choice for generalizing to new data. The fact that the MSE values change with each seed also highlights a key disadvantage of this simple method: the results can be variable depending on the random split. 主要结论:在所有随机拆分中,**二次模型的验证集 MSE 最低。这证明了二次模型是推广到新数据的最佳选择。MSE 值随每个种子变化的事实也凸显了这种简单方法的一个关键缺点:结果可能会因随机拆分而变化。

2. The Validation Set Approach 验证集方法

This method is a simple way to estimate a model’s performance on new, unseen data (the “test error”). 这种方法是一种简单的方法,用于评估模型在新的、未见过的数据(“测试误差”)上的性能。 The core idea is to randomly split your available data into two parts: 其核心思想是将可用数据随机拆分为两部分: 1. Training Set: Used to fit (or “train”) your model. 用于拟合(或“训练”)模型。 2. Validation Set (or Test Set): Used to evaluate the trained model’s performance. You calculate the error (like Mean Squared Error) on this set. 用于评估训练后的模型性能。计算此集合的误差(例如均方误差)。

Python Code Explained (Slide 1)

The first slide shows a Python example using the Auto dataset to predict mpg from horsepower.

  1. Setup & Data Loading:
    • import statements load libraries like pandas (for data), sklearn.model_selection.train_test_split (the key function for this method), and sklearn.linear_model.LinearRegression.
    • Auto = pd.read_csv(...) loads the data.
    • X = Auto['horsepower'].values and y = Auto['mpg'].values select the variables of interest.
  2. The Split:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=007)
    • This is the most important line for this method. It splits the data X and y into training and testing (validation) sets.
    • train_size=0.5 means 50% of the data is for training and 50% is for validation.
    • random_state=007 ensures the split is “random” but “reproducible” (using the same seed 007 will always produce the same split).
  3. Model Fitting & Evaluation:
    • The code fits three different polynomial models, but it only uses the training data (X_train, y_train) to do so.
    • Linear (Degree 1): A simple LinearRegression.
    • Quadratic (Degree 2): Uses PolynomialFeatures(2) to create \(x\) and \(x^2\) terms, then fits a linear model to them.
    • Cubic (Degree 3): Uses PolynomialFeatures(3) to create \(x\), \(x^2\), and \(x^3\) terms.
    • It then calculates the Mean Squared Error (MSE) for all three models using the test data (X_test, y_test).
  4. Results (from the text on the slide):
    • Linear MSE: \(\approx 23.3\)
    • Quadratic MSE: \(\approx 19.4\)
    • Cubic MSE: \(\approx 19.4\)
    • Conclusion: The quadratic model gives a significantly lower error than the linear model. The cubic model does not offer any real improvement over the quadratic one.
    结果(来自幻灯片上的文字):
    • 线性均方误差:约 23.3
    • 二次均方误差:约 19.4
    • 三次均方误差:约 19.4
    • 结论:二次模型的误差显著低于线性模型。三次模型与二次模型相比并没有任何实质性的改进。

Key Images: The Problem with a Single Split

The most important images are on slide 9 (labeled “Figure” and “Page 20”).

  • Plot on the Left (Single Split): This graph shows the validation MSE for polynomial degrees 1 through 10, based on the single random split from the R code (slide 2). Just like the Python example, it shows that the MSE drops sharply from degree 1 to 2, and then stays relatively low. Based on this one chart, you might pick degree 2 (quadratic) as the best model.

**此图显示了多项式次数为 1 至 10 的验证均方误差,基于 R 代码(幻灯片 2)中的单次随机分割。与 Python 示例一样,它显示 MSE 从 1 阶到 2 阶急剧下降,然后保持在相对较低的水平。基于这张图,您可能会选择 2 阶(二次)作为最佳模型。

  • Plot on the Right (Ten Splits): This is the most critical plot. It shows the results of repeating the entire process 10 times, each with a new random split (from R code on slide 3).
    • You can see 10 different error curves.
    • While they all agree that degree 1 (linear) is bad, they do not agree on the best model. Some curves suggest degree 2 is best, others suggest 3, 4, or even 6.
    这是最关键的图表**。它显示了重复整个过程 10 次的结果,每次都使用新的随机分割(来自幻灯片 3 上的 R 代码)。
    • 您可以看到 10 条不同的误差曲线。
    • 虽然他们都认为 1 阶(线性)模型不好,但他们对最佳模型的看法并不一致。有些曲线表明 2 阶最佳,而另一些则表明 3 阶、4 阶甚至 6 阶最佳。

Summary of Drawbacks (Slides 7, 8, 9, 23, 25)

The slides repeatedly emphasize the two main drawbacks of this simple validation set approach:

  1. High Variability 高变异性: The estimated test MSE can be highly variable, depending on which observations happen to land in the training set versus the validation set. The plot with 10 curves (slide 9, right) proves this perfectly. 估计的测试 MSE 可能高度变异,具体取决于哪些观测值恰好落在训练集和验证集中。包含 10 条曲线的图表(幻灯片 9,右侧)完美地证明了这一点。

  2. Overestimation of Test Error 高估测试误差:

    • The model is only trained on a subset (e.g., 50%) of the available data. The validation data is “wasted” and not used for model building.
    • Statistical methods tend to perform worse when trained on fewer observations.
    • Therefore, the model trained on just the training set is likely worse than a model trained on the entire dataset.
    • This “worse” model will have a higher error rate on the validation set. This means the validation set MSE tends to overestimate the true test error you would get from a model trained on all your data.
    • 该模型仅基于可用数据的子集(例如 50%)进行训练。验证数据被“浪费”了,并未用于模型构建。
    • 统计方法在较少的观测值上进行训练时往往表现较差。
    • 因此,仅基于训练集训练的模型可能比基于整个数据集训练的模型更差
    • 这个“更差”的模型在验证集上的错误率会更高。这意味着验证集的 MSE 倾向于高估基于所有数据训练的模型的真实测试误差。

3. Cross-Validation: The Solution 交叉验证:解决方案

The slides introduce Cross-Validation (CV) as the method to overcome these drawbacks. The core idea is to use all data points for both training and validation, just at different times. 交叉验证 (CV),以此来克服这些缺点。其核心思想是将所有数据点用于训练和验证,只是使用的时间不同。

Leave-One-Out Cross-Validation (LOOCV) 留一法交叉验证 (LOOCV)

This is the first type of CV introduced (slide 10, page 26). For a dataset with \(n\) data points:

  1. Hold out the 1st data point (this is your validation set). 保留第一个数据点(这是你的验证集)。
  2. Train the model on the other \(n-1\) data points. 使用其他 \(n-1\) 个数据点训练模型。
  3. Calculate the error (e.g., \(\text{MSE}_1\)) using only that 1st held-out point. 仅使用第一个保留点计算误差(例如,\(\text{MSE}_1\))。
  4. Repeat this \(n\) times, holding out the 2nd point, then the 3rd, and so on, until every point has been used as the validation set exactly once. 重复此操作 \(n\) 次,保留第二个点,然后是第三个点,依此类推,直到每个点都作为验证集使用一次。
  5. Your final test error estimate is the average of all \(n\) errors. 最终的测试误差估计是所有 \(n\) 个误差的平均值

Key Formula (from Slide 10)

The formula for the \(n\)-fold LOOCV error estimate is: \(n\) 倍 LOOCV 误差估计公式为: \[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \text{MSE}_i\]

Where: * \(n\) is the total number of data points. 是数据点的总数。 * \(\text{MSE}_i\) is the Mean Squared Error calculated on the \(i\)-th data point when it was held out. 是保留第 \(i\) 个数据点时计算的均方误差。

3.What is LOOCV (Leave-One-Out Cross Validation)

Leave-One-Out Cross Validation (LOOCV) is a method for estimating the test error of a model. For a dataset with \(n\) observations, you: 留一交叉验证 (LOOCV) 是一种估算模型测试误差的方法。对于包含 \(n\) 个观测值的数据集,您需要:

  1. Fit the model \(n\) times. 对模型进行 \(n\) 次拟合
  2. For each fit \(i\) (from \(1\) to \(n\)), you train the model on all data points except for observation \(i\). 对于每个拟合 \(i\) 个样本(从 \(1\)\(n\)),您需要在除观测值 \(i\) 之外的所有数据点上训练模型。
  3. You then use this trained model to make a prediction for the single observation \(i\) that was left out. 然后,您需要使用这个训练好的模型对被遗漏的单个观测值 \(i\) 进行预测。
  4. The final LOOCV error is the average of the \(n\) prediction errors (typically the Mean Squared Error, or MSE). 最终的 LOOCV 误差是 \(n\) 个预测误差的平均值(通常为均方误差,简称 MSE)。

This process is shown visually in the slide titled “LOOCV” (slide 27), which is a key image for understanding the concept. Pros & Cons (from slide 28): * Pro: It has low bias because the training set (\(n-1\) samples) is almost identical to the full dataset.由于训练集(\(n-1\) 个样本)与完整数据集几乎完全相同,因此偏差较低。 * Pro: It produces a stable, non-random error estimate (unlike \(k\)-fold CV, which depends on the random fold assignments). 它能产生稳定的非随机误差估计(不同于 k 倍交叉验证,后者依赖于随机折叠分配)。 * Con: It can be extremely computationally expensive, as the model must be refit \(n\) times. 由于模型必须重新拟合 \(n\) 次,计算成本极其高昂。 * Con: The \(n\) error estimates can be highly correlated, which can sometimes lead to high variance in the final \(CV\) estimate. 这 \(n\) 个误差估计可能高度相关,有时会导致最终 \(CV\) 估计值出现较大方差。

Key Mathematical Formulas

The main challenge of LOOCV (being computationally expensive) has a very efficient solution for linear models. LOOCV 的主要挑战(计算成本高昂)对于线性模型来说,有一个非常有效的解决方案。

1. The Standard (Slow) Formula

As defined on slide 33, the LOOCV estimate of the MSE is:

\[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i^{(i)})^2\]

  • \(y_i\) is the true value of the \(i\)-th observation. 是第 \(i\) 个观测值的真实值。
  • \(\hat{y}_i^{(i)}\) is the predicted value for \(y_i\) from a model trained on all data except observation \(i\). 是使用除观测值 \(i\) 之外的所有数据训练的模型对 \(y_i\) 的预测值。

Calculating \(\hat{y}_i^{(i)}\) requires refitting the model \(n\) times. 计算 \(\hat{y}_i^{(i)}\) 需要重新拟合模型 \(n\) 次。

2. The Shortcut (Fast) Formula

Slide 34 provides a much simpler formula that only requires fitting the model once on the entire dataset: 只需对整个*数据集进行一次模型拟合**:

\[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2\]

  • \(\hat{y}_i\) is the prediction for \(y_i\) from the model trained on all \(n\) data points. 是使用所有 \(n\) 个数据点训练的模型对 \(y_i\) 的预测值。
  • \(h_i\) is the leverage of the \(i\)-th observation. 是第 \(i\) 个观测值的杠杆率

3. What is Leverage (\(h_i\))?

Slide 35 defines leverage:

  • Hat Matrix (\(\mathbf{H}\)): In a linear model, the fitted values \(\hat{\mathbf{y}}\) are related to the true values \(\mathbf{y}\) by the hat matrix: \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\).

  • Formula: The hat matrix is defined as \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\).

  • Leverage (\(h_i\)): The leverage for the \(i\)-th observation is simply the \(i\)-th diagonal element of the hat matrix, \(h_{ii}\) (often just written as \(h_i\)).

    • \(h_i = \mathbf{x}_i^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i\)
  • Meaning: Leverage measures how “influential” an observation’s \(x_i\) value is in determining its own predicted value \(\hat{y}_i\). A high leverage score means that point has a lot of influence on the model’s fit.

  • 帽子矩阵 (\(\mathbf{H}\)):在线性模型中,拟合值 \(\hat{\mathbf{y}}\) 与真实值 \(\mathbf{y}\) 之间存在帽子矩阵关系:\(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\)

  • 公式:帽子矩阵定义为 \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)

  • 杠杆率 (\(h_i\)):\(i\) 个观测值的杠杆率就是帽子矩阵的第 \(i\) 个对角线元素 \(h_{ii}\)(通常写为 \(h_i\))。

  • \(h_i = \mathbf{x}_i^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i\)

  • 含义:杠杆率衡量观测值的 \(x_i\) 值对其自身预测值 \(\hat{y}_i\) 的“影响力”。杠杆率得分高意味着该点对模型拟合有很大影响。

This shortcut formula is extremely important because it makes LOOCV as fast to compute as a single model fit.这个快捷公式非常重要,因为它使得 LOOCV 的计算速度与单个模型拟合一样快。

Python Code Explained (Slide 29)

This slide shows how to use LOOCV to select the best polynomial degree for predicting mpg from horsepower.

  1. Imports: It imports standard libraries (pandas, matplotlib) and key modules from sklearn:
    • LinearRegression: The model to be fit.
    • PolynomialFeatures: A tool to create polynomial terms (e.g., \(x, x^2, x^3\)).
    • LeaveOneOut: The LOOCV cross-validation strategy object.
    • cross_val_score: A function that automatically runs a cross-validation test.
  2. Setup:
    • It loads the Auto.csv data.
    • It defines \(X\) (horsepower) and \(y\) (mpg).
    • It creates a LeaveOneOut object: loo = LeaveOneOut().
  3. Looping through Degrees:
    • The code loops degree from 1 to 10.
    • make_pipeline: For each degree, it creates a model using make_pipeline. This pipeline is a crucial concept:
      • It first runs PolynomialFeatures(degree) to transform \(X\) into \([X, X^2, ..., X^{\text{degree}}]\).
      • It then feeds those features into LinearRegression() to fit the model.
    • cross_val_score: This is the most important line.
      • scores = cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error')
      • This function automatically does the entire LOOCV process. It takes the model (the pipeline), the data \(X\) and \(y\), and the CV strategy (cv=loo).
      • sklearn’s cross_val_score uses the “fast” leverage method internally for linear models, so it doesn’t actually fit the model \(n\) times.
      • It uses scoring='neg_mean_squared_error' because the scoring function assumes “higher is better.” By calculating the negative MSE, the best model will have the highest score (i.e., closest to 0).
    • Storing Results: It calculates the mean of the scores (which is the \(CV_{(n)}\)) and stores it.
  4. Visualization:
    • The code then plots the final cv_errors (after flipping the sign back to positive) against the degree.
    • The resulting plot (also on slide 32) shows the test MSE, allowing you to visually pick the best degree (where the error is minimized).
    • 生成的图(也在幻灯片 32 上)显示了测试 MSE,让您可以直观地选择最佳 degree(误差最小化的 degree)。

Important Images

  • Slide 27 (.../103628.png): This is the best conceptual image. It visually demonstrates how LOOCV splits the data \(n\) times, with each observation getting one turn as the validation set. 这是最佳概念图**。它直观地展示了 LOOCV 如何将数据拆分 \(n\) 次,每个观察值都会被旋转一次作为验证集。

  • Slide 34 (.../103711.png): This slide presents the most important formula: the “Easy formula” or shortcut, \(CV_{(n)} = \frac{1}{n} \sum (\frac{y_i - \hat{y}_i}{1 - h_i})^2\). This is the key takeaway for computing LOOCV efficiently in linear models. 这张幻灯片展示了最重要的公式**:“简单公式”或简称,\(CV_{(n)} = \frac{1}{n} \sum (\frac{y_i - \hat{y}_i}{1 - h_i})^2\)。这是在线性模型中高效计算 LOOCV 的关键要点。

  • Slide 32 (.../103701.jpg): This is the key results image. It contrasts the LOOCV error curve (left) with the 10-fold CV error curves (right). It clearly shows that LOOCV produces a single, stable error curve, while 10-fold CV results vary slightly each time it’s run due to the random data splits. 这是关键结果图**。它将 LOOCV 误差曲线(左)与 10 倍 CV 误差曲线(右)进行了对比。它清楚地表明,LOOCV 产生了单一、稳定的误差曲线,而由于数据分割的随机性,10 倍 CV 的结果每次运行时都会略有不同。

4. Cross-Validation Overview

These slides explain Cross-Validation (CV), a method used to estimate the test error of a model, helping to select the best level of flexibility (e.g., the best polynomial degree). It’s an improvement over a single validation set because it uses all the data for both training and validation at different times. 这是一种用于估算模型测试误差的方法,有助于选择最佳的灵活性(例如,最佳多项式次数)。它比单个验证集有所改进,因为它在不同时间使用所有数据进行训练和验证。

The two main types discussed are K-fold CV and Leave-One-Out CV (LOOCV). 主要讨论的两种类型是K 折交叉验证留一法交叉验证 (LOOCV)

K-Fold Cross-Validation K 折交叉验证

This is the most common method.

The Process

As shown in the slides, the K-fold CV process is: 1. Divide the dataset randomly into \(K\) non-overlapping groups (or “folds”), usually of equal size. Common choices are \(K=5\) or \(K=10\). 将数据集随机划分\(K\) 个不重叠的组(或“折”),通常大小相等。常见的选择是 \(K=5\)\(K=10\)。 2. Iterate \(K\) times: In each iteration \(i\), use the \(i\)-th fold as the validation set and all other \(K-1\) folds combined as the training set. 迭代 \(K\):在每次迭代 \(i\) 中,使用第 \(i\) 个样本集作为验证集,并将所有其他 \(K-1\) 个样本集合并作为训练集。 3. Calculate the Mean Squared Error (\(MSE_i\)) on the validation fold. 计算验证集的均方误差 (\(MSE_i\))。 4. Average all \(K\) error estimates to get the final CV score. 平均所有 \(K\) 个误差估计值,得到最终的 CV 分数。 ### Key Formula The final K-fold CV error estimate is the average of the errors from each fold: 最终的 K 折 CV 误差估计值是每个样本集误差的平均值: \[CV_{(K)} = \frac{1}{K} \sum_{i=1}^{K} MSE_i\]

Important Image: The Concept

The diagram in slide 104145.png is the most important for understanding the concept of K-fold CV. It shows a dataset split into 5 folds (\(K=5\)). The process is repeated 5 times, with a different fold (in beige) held out as the validation set in each run, while the rest (in blue) is used for training. 它展示了一个被分成 5 个样本集 (\(K=5\)) 的数据集。该过程重复 5 次,每次运行都会保留一个不同的折叠(米色)作为验证集,其余折叠(蓝色)用于训练。

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is just a special case of K-fold CV where \(K = n\) (the total number of observations). LOOCV 只是 K 折交叉验证的一个特例,其中 \(K = n\)(观测值总数)。 * You create \(n\) “folds,” each containing just one data point. 创建 \(n\) 个“折叠”,每个折叠仅包含一个数据点。 * You train the model \(n\) times, each time leaving out a single different observation and then calculating the error for that one point. 对模型进行 \(n\) 次训练,每次都省略一个不同的观测值,然后计算该点的误差。

Key Formulas

  1. Standard Definition: The LOOCV error is the average of the \(n\) squared errors: \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2\] where \(e_{[i]} = y_i - \hat{y}_{[i]}\) is the prediction error for the \(i\)-th observation, calculated from a model that was trained on all data except the \(i\)-th observation. This looks computationally expensive. LOOCV 误差是 \(n\) 个平方误差的平均值: \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2\] 其中 \(e_{[i]} = y_i - \hat{y}_{[i]}\) 是第 \(i\) 个观测值的预测误差,该误差由一个使用除第 \(i\) 个观测值以外的所有数据训练的模型计算得出。这看起来计算成本很高。

  2. Fast Computation (for Linear Regression): A key point from the slides is that for linear regression, you don’t need to re-fit the model \(N\) times. You can fit the model once on all \(N\) data points and use the following shortcut: \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\]

    • \(e_i = y_i - \hat{y}_i\) is the standard residual (from the model fit on all data).
    • \(h_i\) is the leverage statistic for the \(i\)-th observation (the \(i\)-th diagonal entry of the “hat matrix” \(H\)). This makes LOOCV as fast to compute as a single model fit. 对于线性回归,您无需重新拟合模型 \(N\) 次。您可以对所有 \(N\) 个数据点一次性地拟合模型,并使用以下快捷方式: \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\]
    • \(e_i = y_i - \hat{y}_i\) 是标准残差(来自对所有数据的模型拟合)。
    • \(h_i\) 是第 \(i\) 个观测值(“帽子矩阵”\(H\) 的第 \(i\) 个对角线元素)的杠杆统计量。 这使得 LOOCV 的计算速度与单次模型拟合一样快。

Python Code & Results

The Python code in slide 104156.jpg shows how to use 10-fold CV to find the best polynomial degree for a model.

Code Understanding (Slide 104156.jpg)

Here’s a breakdown of the key sklearn parts:

  1. from sklearn.pipeline import make_pipeline: This is used to chain steps. The pipeline make_pipeline(PolynomialFeatures(degree), LinearRegression()) first creates polynomial features (like \(x\), \(x^2\), \(x^3\)) and then fits a linear model to them.
  2. from sklearn.model_selection import KFold: This object is used to define the \(K\)-fold split strategy. kf = KFold(n_splits=10, shuffle=True, random_state=1) creates a 10-fold splitter that shuffles the data first.
  3. from sklearn.model_selection import cross_val_score: This is the most important function.
    • scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
    • This one function does all the work: it takes the model (the pipeline), the data X and y, and the CV splitter kf. It automatically trains and evaluates the model 10 times and returns an array of 10 scores (one for each fold).
    • scoring='neg_mean_squared_error' is used because cross_val_score expects a higher score to be better. Since we want to minimize MSE, we use negative MSE.
  4. avg_mse = -scores.mean(): The code averages the 10 scores and flips the sign back to positive to get the final CV (MSE) estimate for that polynomial degree.

Important Image: The Results

The plots in slides 104156.jpg (Python) and 104224.png (R) show the key result.

  • X-axis: Degree of Polynomial (model complexity).多项式的次数(模型复杂度)。
  • Y-axis: Estimated Test Error (CV Error / MSE).估计测试误差(CV 误差 / MSE)。
  • Interpretation: The plot shows a clear “U” shape. The error is high for degree 1 (a simple line), drops to its minimum at degree 2 (a quadratic \(ax^2 + bx + c\)), and then starts to rise again for higher degrees. This rise indicates overfitting—the more complex models are fitting the training data’s noise, leading to worse performance on unseen validation data. 该图呈现出清晰的“U”形。1 次(一条简单的直线)时误差较大,在2 次(二次 \(ax^2 + bx + c\))时降至最小,然后随着次数的增加,误差再次上升。这种上升表明过拟合——更复杂的模型会拟合训练数据的噪声,导致在未见过的验证数据上的性能下降。
  • Conclusion: The 10-fold CV analysis suggests that a quadratic model (degree 2) is the best choice, as it provides the lowest estimated test error. 10 倍 CV 分析表明二次模型(2 次)是最佳选择,因为它提供了最低的估计测试误差。

Let’s dive into the details of that proof.

Detailed Summary: The “Fast Computation of LOOCV” Proof

The most mathematically dense and important part of your slides is the proof (spanning slides 104126.jpg, 104132.png, and 104136.png) that LOOCV, which seems computationally very expensive, can be calculated quickly for linear regression. LOOCV 虽然计算成本看似非常高,但对于线性回归来说,它可以快速计算。 ### The Goal

The goal is to prove that the LOOCV statistic, which is defined as: \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2 \quad \text{where } e_{[i]} = y_i - \hat{y}_{[i]}\] (Here, \(\hat{y}_{[i]}\) is the prediction for \(y_i\) from a model trained on all data except point \(i\)).(其中,\(\hat{y}_{[i]}\) 表示基于除点 \(i\) 之外的所有数据训练的模型对 \(y_i\) 的预测)。

…can be computed without re-fitting the model \(N\) times, using this “fast” formula: 无需重新拟合模型 \(N\) 次即可计算,使用以下“快速”公式: \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\] (Here, \(e_i\) is the standard residual and \(h_i\) is the leverage, both from a single model fit on all data).

The entire proof boils down to showing one identity: \(e_{[i]} = e_i / (1 - h_i)\).

Key Definitions (The Matrix Algebra Setup) (矩阵代数设置)

  • Model 模型: \(\mathbf{Y} = \mathbf{X}\beta + \mathbf{e}\)
  • Full Data Estimate 完整数据估计 (\(\hat{\beta}\)): \(\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\)
  • Hat Matrix 帽子矩阵 (\(\mathbf{H}\)): \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)
  • Full Data Residual 完整数据残差 (\(e_i\)): \(e_i = y_i - \hat{y}_i = y_i - \mathbf{x}_i^T\hat{\beta}\)
  • Leverage (\(h_i\)) 杠杆 (\(h_i\)): The \(i\)-th diagonal element of \(\mathbf{H}\). \(h_i = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\)
  • Leave-One-Out Estimate (\(\hat{\beta}_{[i]}\)): \(\hat{\beta}_{[i]} = (\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]}\)
    • \(\mathbf{X}_{[i]}\) and \(\mathbf{Y}_{[i]}\) are the data with the \(i\)-th row removed.
  • LOOCV Residual LOOCV 残差 (\(e_{[i]}\)): \(e_{[i]} = y_i - \mathbf{x}_i^T\hat{\beta}_{[i]}\)

The Proof Step-by-Step

Here is the logic from your slides, broken down:

Step 1: Relating the Matrices (Slide 104132.png)

The proof’s “trick” is to relate the “full data” matrix \((\mathbf{X}^T\mathbf{X})\) to the “leave-one-out” matrix \((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})\). 证明的“技巧”是将“全数据”矩阵 \((\mathbf{X}^T\mathbf{X})\) 与“留一法”矩阵 \((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})\) 关联起来。

  • The full sum-of-squares matrix is just the leave-one-out matrix plus the one observation’s contribution: 完整的平方和矩阵就是留一法矩阵加上一个观测值的贡献:

    \[\mathbf{X}^T\mathbf{X} = \mathbf{X}_{[i]}^T\mathbf{X}_{[i]} + \mathbf{x}_i\mathbf{x}_i^T\]

  • This means: \(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]} = \mathbf{X}^T\mathbf{X} - \mathbf{x}_i\mathbf{x}_i^T\)

Step 2: The Key Matrix Trick (Slide 104132.png)

We need the inverse \((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}\) to calculate \(\hat{\beta}_{[i]}\). Finding this inverse directly is hard. Instead, we use the Sherman-Morrison-Woodbury formula, which tells us how to find the inverse of a matrix that’s been “updated” (in this case, by subtracting \(\mathbf{x}_i\mathbf{x}_i^T\)).

我们需要逆\((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}\) 来计算 \(\hat{\beta}_{[i]}\)。直接求这个逆矩阵很困难。因此,我们使用 Sherman-Morrison-Woodbury 公式,它告诉我们如何求一个“更新”后的矩阵的逆矩阵(在本例中,是通过减去 \(\mathbf{x}_i\mathbf{x}_i^T\) 来实现的)。

The slide applies this formula to get: \[(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_i}\] * This is the most complex step, but it’s a standard matrix identity. It’s crucial because it expresses the “leave-one-out” inverse in terms of the “full data” inverse \((\mathbf{X}^T\mathbf{X})^{-1}\), which we already have.

Step 3: Finding \(\hat{\beta}_{[i]}\) (Slide 104136.png)

Now we can write a new formula for \(\hat{\beta}_{[i]}\) by substituting the result from Step 2. We also note that \(\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]} = \mathbf{X}^T\mathbf{Y} - \mathbf{x}_i y_i\).

\[\hat{\beta}_{[i]} = (\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1} (\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]})\] \[\hat{\beta}_{[i]} = \left[ (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_i} \right] (\mathbf{X}^T\mathbf{Y} - \mathbf{x}_i y_i)\]

The slide then shows the algebra to simplify this big expression. When you expand and simplify everything, you get a much cleaner result:

\[\hat{\beta}_{[i]} = \hat{\beta} - (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \frac{e_i}{1 - h_i}\] * This is a beautiful result! It says the LOOCV coefficient vector is just the full coefficient vector minus a small adjustment term related to the \(i\)-th observation’s residual (\(e_i\)) and leverage (\(h_i\)). * 这是一个非常棒的结果!它表明 LOOCV 系数向量就是完整的系数向量减去一个与第 \(i\) 个观测值的残差 (\(e_i\)) 和杠杆率 (\(h_i\)) 相关的小调整项。

Step 4: Finding \(e_{[i]}\) (Slide 104136.png)

This is the final step. We use the definition of \(e_{[i]}\) and the result from Step 3. 这是最后一步。我们使用 \(e_{[i]}\) 的定义和步骤 3 的结果。

  • Start with the definition: \(e_{[i]} = y_i - \mathbf{x}_i^T\hat{\beta}_{[i]}\)
  • Substitute \(\hat{\beta}_{[i]}\): \(e_{[i]} = y_i - \mathbf{x}_i^T \left[ \hat{\beta} - (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \frac{e_i}{1 - h_i} \right]\)
  • Distribute \(\mathbf{x}_i^T\): \(e_{[i]} = (y_i - \mathbf{x}_i^T\hat{\beta}) + \left( \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \right) \frac{e_i}{1 - h_i}\)
  • Recognize the terms!
    • The first term is just the standard residual: \((y_i - \mathbf{x}_i^T\hat{\beta}) = e_i\)
    • The second term in parentheses is the definition of leverage: \((\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i) = h_i\)
  • Substitute back: \(e_{[i]} = e_i + h_i \left( \frac{e_i}{1 - h_i} \right)\)
  • Get a common denominator: \(e_{[i]} = \frac{e_i(1 - h_i) + h_i e_i}{1 - h_i}\)
  • Simplify the numerator: \(e_{[i]} = \frac{e_i - e_ih_i + e_ih_i}{1 - h_i}\)

This gives the final, simple relationship: \[e_{[i]} = \frac{e_i}{1 - h_i}\]

Conclusion

By proving this identity, the slides show that to get all \(N\) of the “leave-one-out” errors, you only need to: 1. Fit one linear regression model on all the data. 2. Calculate the standard residuals \(e_i\) and the leverage values \(h_i\) for all \(N\) points. 3. Apply the formula \(e_i / (1 - h_i)\) for each point.

This turns a procedure that looked like it would take \(N\) times the work into a procedure that takes only 1 model fit. This is why LOOCV is a practical and efficient method for linear regression.

通过证明这个恒等式,幻灯片显示,要获得所有 \(N\) 个“留一法”误差,您只需: 1. 对所有数据拟合一个线性回归模型。 2. 计算所有 \(N\) 个点的标准残差 \(e_i\) 和杠杆值 \(h_i\)。 3. 对每个点应用公式 \(e_i / (1 - h_i)\)

这将一个看似需要 \(N\) 倍工作量的过程变成了只需 1 次模型拟合的过程。这就是为什么 LOOCV 是一种实用且高效的线性回归方法。

5. Main Goal of Cross-Validation 交叉验证的主要目标

The central purpose of cross-validation is to estimate the true test error of a machine learning model. This is crucial for:

  1. Model Assessment: Evaluating how well a model will perform on new, unseen data. 评估模型在新的、未见过的数据上的表现。
  2. Model Selection: Choosing the best level of model flexibility (e.g., the degree of a polynomial or the value of \(K\) in KNN) to avoid overfitting. 选择最佳的模型灵活性水平(例如,多项式的次数或 KNN 中的 \(K\) 值),以避免过拟合

As the slides show, training error (the error on the data the model was trained on) consistently decreases as model complexity increases. However, the test error follows a U-shape: it first decreases (as the model learns the true signal) and then increases (as the model starts fitting the noise, or “overfitting”). CV helps find the minimum point of this U-shaped test error curve. 训练误差(模型训练数据的误差)随着模型复杂度的增加而持续下降。然而,测试误差呈现 U 形:它先下降(当模型学习真实信号时),然后上升(当模型开始拟合噪声,即“过拟合”时)。交叉验证有助于找到这条 U 形测试误差曲线的最小值。

Important Images 🖼️

The most important image is on Slide 61.

These two plots perfectly illustrate the concept:

  • Blue Line (Training Error): Always goes down.
  • Brown Line (True Test Error): Forms a “U” shape. This is what we want to find the minimum of, but it’s unknown in practice.
  • Black Line (10-fold CV Error): This is our estimate of the test error. Notice how closely it tracks the brown line. The minimum of the CV curve (marked with an ‘x’) is very close to the minimum of the true test error.

This shows why CV works: it provides a reliable estimate to guide our choice of model (e.g., polynomial degree 3-4 for logistic regression, or \(K \approx 10\) for KNN).

  • 蓝线(训练误差):始终向下。
  • 棕线(真实测试误差):呈“U”形。这正是我们想要找到的最小值,但在实际应用中无法确定。
  • 黑线(10 倍 CV 误差):这是我们对测试误差的估计。注意它与棕线的吻合程度。CV 曲线的最小值(标有“x”)非常接近真实测试误差的最小值。

这说明了 CV 的原因:它提供了可靠的估计值来指导我们选择模型(例如,逻辑回归的多项式次数为 3-4,KNN 的 \(K \approx 10\))。

Key Formulas for Classification

For regression, we often use Mean Squared Error (MSE). For classification, the slides introduce the classification error rate.

For Leave-One-Out Cross-Validation (LOOCV), the error for a single observation \(i\) is: \[Err_i = I(y_i \neq \hat{y}_i^{(i)})\] * \(y_i\) is the true label for observation \(i\). * \(\hat{y}_i^{(i)}\) is the model’s prediction for observation \(i\) when the model was trained on all other observations except \(i\). * \(I(\dots)\) is an indicator function: it’s \(1\) if the condition is true (prediction is wrong) and \(0\) if false (prediction is correct).

The total CV error is simply the average of these individual errors, which is the overall fraction of incorrect classifications: \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\] The slides also show examples using Log Loss (Slide 64), which is another common and sensitive metric for classification. The logistic regression model itself is defined by: \[P(Y=1 | X) = \frac{1}{1 + \exp(-\beta_0 - \beta_1 X_1 - \beta_2 X_2 - \dots)}\]

对于回归,我们通常使用均方误差 (MSE)。对于分类,幻灯片介绍了分类错误率

对于留一交叉验证 (LOOCV),单个观测值 \(i\) 的误差为: \[Err_i = I(y_i \neq \hat{y}_i^{(i)})\] * \(y_i\) 是观测值 \(i\) 的真实标签。 * \(\hat{y}_i^{(i)}\) 是模型在除 \(i\) 之外的所有其他观测值上进行训练后,对观测值 \(i\) 的预测。 * \(I(\dots)\) 是一个指示函数:如果条件为真(预测错误),则为 \(1\);如果条件为假(预测正确),则为 \(0\)

CV误差只是这些单个误差的平均值,也就是错误分类的总体比例: \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\] 幻灯片还展示了使用对数损失(幻灯片64)的示例,这是另一个常见且敏感的分类指标。逻辑回归模型本身的定义如下: \[P(Y=1 | X) = \frac{1}{1 + \exp(-\beta_0 - \beta_1 X_1 - \beta_2 X_2 - \dots)}\]

Python Code Explained 🐍

The slides provide two key Python examples. Both manually implement K-fold cross-validation to show how it works.

1. KNN Regression (Slide 52) KNN 回归

  • Goal: Find the best n_neighbors (K) for a KNeighborsRegressor. 为 KNeighborsRegressor 找到最佳的 n_neighbors (K)。
  • Logic:
    1. It creates a KFold object to split the data into 10 folds (n_splits=10). 创建一个 KFold 对象,将数据拆分成 10 个折叠(n_splits=10)。
    2. It has an outer loop that iterates through different values of \(K\) (from 1 to 10). 它有一个 外循环,迭代不同的 \(K\) 值(从 1 到 10)。
    3. It has an inner loop that iterates through the 10 folds (for train_index, test_index in kfold.split(X)). 它有一个 内循环,迭代这 10 个折叠(for train_index, test_index in kfold.split(X))。
    4. Inside the inner loop:
      • It trains a KNeighborsRegressor on the 9 training folds (X_train, y_train).
      • It makes predictions on the 1 held-out test fold (X_test).
      • It calculates the mean squared error for that fold and stores it.
      • 在 9 个训练折叠(X_train, y_train)上训练 KNeighborsRegressor
      • 它对第一个保留的测试集 (X_test) 进行预测。
      • 它计算该集的均方误差并存储。
    5. After the inner loop: It averages the 10 error scores (one from each fold) to get the final CV error for that specific \(K\). 对 10 个误差分数(每个集一个)求平均值,得到该特定 \(K\) 的最终 CV 误差。
    6. The final plot shows this CV error vs. \(K\), allowing us to pick the \(K\) with the lowest error. 最终图表显示了 CV 误差与 \(K\) 的关系,使我们能够选择误差最小的 \(K\)

2. Logistic Regression with Polynomials (Slide 64) 使用多项式的逻辑回归

  • Goal: Find the best degree for PolynomialFeatures used with LogisticRegression.
  • Logic: This is very similar to the KNN example but uses a different model and error metric.
    1. It sets up a 10-fold split (kf = KFold(...)).
    2. An outer loop iterates through the degree \(d\) (from 1 to 10).
    3. An inner loop iterates through the 10 folds.
    4. Inside the inner loop:
      • It creates PolynomialFeatures of degree \(d\).
      • It transforms the 9 training folds (X_train) into polynomial features (X_train_poly).
      • It trains a LogisticRegression model on X_train_poly.
      • It transforms the 1 held-out test fold (X_test) using the same polynomial transformer.
      • It calculates the log_loss on the test fold.
    5. After the inner loop: It averages the 10 log_loss scores to get the final CV error for that degree.
    6. The plot shows CV error vs. degree, and the minimum is clearly at degree=3.

The Bias-Variance Trade-off in CV CV 中的偏差-方差权衡

This is a key theoretical point from Slide 54 that answers the questions on Slide 65. It compares LOOCV (\(K=n\)) with K-fold CV (\(K=5\) or \(10\)). 这是幻灯片 54中的一个关键理论点,它回答了幻灯片 65 中的问题。它比较了 LOOCV(K=n)和 K 倍 CV(K=5 或 10)。

  • LOOCV (K=n):
    • Bias: Very low. The model is trained on \(n-1\) samples, which is almost the full dataset. The resulting error estimate is nearly unbiased for the true test error. 该模型基于 \(n-1\) 个样本进行训练,这几乎是整个数据集。得到的误差估计对于真实测试误差几乎没有偏差。
    • Variance: Very high. You are training \(n\) models that are almost identical to each other (they only differ by one data point). Averaging these highly correlated error estimates doesn’t reduce the variance much, making the CV estimate unstable. 非常。您正在训练 \(n\) 个彼此几乎相同的模型(它们仅相差一个数据点)。对这些高度相关的误差估计求平均值并不能显著降低方差,从而导致 CV 估计不稳定。
  • K-Fold CV (K=5 or 10):
    • Bias: Slightly higher than LOOCV. The models are trained on, for example, 90% of the data. Since they are trained on less data, they might perform slightly worse. This means K-fold CV tends to slightly overestimate the true test error (Slide 66).
    • Variance: Much lower than LOOCV. The 10 models are trained on more different “chunks” of data (they overlap less), so their error estimates are less correlated. Averaging less-correlated estimates significantly reduces the overall variance.

Conclusion: We generally prefer 10-fold CV over LOOCV. It gives a much more stable (low-variance) estimate of the test error, even if it’s slightly more biased (overestimating the error, which is a safe/conservative estimate). 我们通常更喜欢10 倍交叉验证而不是 LOOCV。它能给出更稳定(低方差)的测试误差估计值,即使它的偏差略大(高估了误差,这是一个安全/保守的估计值)。

The Core Problem & Scenarios (Slides 47-51)

These slides use three scenarios to show why we need cross-validation (CV). The goal is to pick the right level of model flexibility (e.g., the degree of a polynomial or the complexity of a spline) to minimize the Test MSE (Mean Squared Error), which we can’t see in real life. 这些幻灯片使用了三种场景来说明为什么我们需要交叉验证 (CV)。目标是选择合适的模型灵活性(例如,多项式的次数或样条函数的复杂度),以最小化测试均方误差(Mean Squared Error),而这在现实生活中是无法观察到的。

  • The Curves (Slide 47): This slide is central.

    • True Test MSE (Blue) 真实测试均方误差(蓝色): This is the real error on new data. It has a U-shape. Error is high for simple models (high bias), drops as the model fits the data, and rises again for overly complex models (high variance, or overfitting). 这是新数据的真实误差。它呈U 形**。对于简单模型(高偏差),误差较高;随着模型拟合数据的深入,误差会下降;对于过于复杂的模型(高方差或过拟合),误差会再次上升。

    • LOOCV (Black Dashed) & 10-Fold CV (Orange) LOOCV(黑色虚线)和 10 倍 CV(橙色): These are our estimates of the true test MSE. Notice how closely they track the blue curve. The ‘x’ marks the minimum of the CV curve, which is our best guess for the model with the minimum test MSE. 这些是我们对真实测试 MSE 的估计。请注意它们与蓝色曲线的吻合程度。“x”标记 CV 曲线的最小值,这是我们对具有最小测试 MSE 的模型的最佳猜测

  • Scenario 1 (Slide 48): The true relationship is non-linear. The right-hand plot shows that the test MSE (red curve) is high for the simple linear model (blue square), but lower for the more flexible smoothing splines (teal squares). CV helps us find the “sweet spot.” 真实的关系是非线性的。右侧图表显示,对于简单的线性模型(蓝色方块),测试 MSE(红色曲线)较高,而对于更灵活的平滑样条函数(蓝绿色方块),测试 MSE 较低。CV 帮助我们找到“最佳点”。

  • Scenario 2 (Slide 49): The true relationship is linear. Here, the test MSE (red curve) is lowest for the simplest model (the linear one, blue square). CV correctly identifies this, and its error estimate (blue square) is lowest for that model. 真实的关系是线性的。在这里,对于最简单的模型(线性模型,蓝色方块),测试 MSE(红色曲线)最低。CV 正确地识别了这一点,并且其误差估计(蓝色方块)是该模型中最低的。

  • Scenario 3 (Slide 50): The true relationship is highly non-linear. The linear model (orange) is a very poor fit. The test MSE (red curve) is minimized by the most flexible model (teal square). CV again finds this. 真实的关系是高度非线性的。线性模型(橙色)拟合度很差。测试 MSE(红色曲线)被最灵活的模型(蓝绿色方块)最小化。CV 再次发现了这一点。

  • Key Takeaway (Slide 51): We use CV to find the tuning parameter (like polynomial degree) that minimizes the test error. We care less about the actual value of the CV error and more about where its minimum is. 我们使用 CV 来找到最小化测试误差的调整参数(例如多项式次数)。我们不太关心 CV 误差的实际值,而更关心它的最小值

CV for Classification (Slides 55-61)

This section shifts from regression (predicting a number, using MSE) to classification (predicting a category, like “blue” or “orange”). 本节从回归(使用 MSE 预测数字)转向分类(预测类别,例如“蓝色”或“橙色”)。

  • New Error Metric (Slide 55): We can’t use MSE. A natural choice is the classification error rate. 我们不能使用 MSE。一个自然的选择是分类错误率
    • \(Err_i = I(y_i \neq \hat{y}_i^{(i)})\)
    • This is an indicator function: it is 1 if the prediction for the \(i\)-th data point (when trained without it) is wrong, and 0 if it’s correct. 如果对第 \(i\) 个数据点的预测(在没有它的情况下训练时)错误,则为 1;如果正确,则为 0
    • The final CV error is just the average of these 0s and 1s, giving the total fraction of misclassified points: \(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\) 最终的 CV 误差就是这些 0 和 1 的平均值,即错误分类点的总比例:\(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\)
  • The Example (Slides 56-61):
    • Slides 56-58: We are shown a “true” (but unknown) non-linear boundary (purple dashed line) separating two classes. We then try to estimate this boundary using logistic regression with different polynomial degrees (degree 1, 2, 3, 4). 我们看到了一条“真实”(但未知)的非线性边界(紫色虚线),它将两个类别分开。然后,我们尝试使用不同次数(1、2、3、4 次)的逻辑回归来估计这条边界。
    • Slides 59-60: This is a crucial point. In this simulated example, we do know the true test error rates. The true errors are [0.201, 0.197, 0.160, 0.162]. The lowest error is for the 3rd-degree polynomial. But in a real-world problem, we can never know these true errors. 这一点至关重要。在这个模拟示例中,我们确实知道真实的测试错误率。真实误差为 [0.201, 0.197, 0.160, 0.162]。最小误差出现在三次多项式中。但在实际问题中,我们永远无法知道这些真实误差
    • Slide 61 (The Solution): This is the most important image. It shows how CV solves the problem from slide 60.展示了 CV 如何解决幻灯片 60 中的问题。
      • Brown Curve (Test Error): This is the true test error (from slide 59). We can’t see this in practice. Its minimum is at degree 3. 这是真实的测试误差(来自幻灯片 59)。我们在实践中看不到它。它的最小值在 3 次方处。

      • Black Curve (10-fold CV Error): This is what we can calculate. It’s our estimate of the test error. Crucially, its minimum is also at degree 3.

      • 黑色曲线(10 倍 CV 误差):这是我们可以计算出来的。这是我们对测试误差的估计。至关重要的是,它的最小值也在 3 次方处。

      • This proves that CV successfully found the best model (degree 3) without ever seeing the true test error. The same logic is shown for the KNN classifier on the right.

      • 这证明 CV 成功地找到了最佳模型(3 次方),而从未看到真实的测试误差。右侧的 KNN 分类器也显示了相同的逻辑。

Python Code Explained (Slides 52, 63, 64)

The slides show how to manually implement K-fold CV. This is great for understanding, even though libraries like GridSearchCV can do this automatically.

  • KNN Regression (Slide 52):
    1. kfold = KFold(n_splits=10, ...): Creates an object that knows how to split the data into 10 folds.
    2. for n_k in neighbors:: This is the outer loop to test different \(K\) values (e.g., \(K\)=1, 2, 3…).
    3. for train_index, test_index in kfold.split(X):: This is the inner loop. For a single \(K\), it loops 10 times.
    4. Inside the inner loop:
      • It splits the data into a 9-fold training set (X_train) and a 1-fold test set (X_test).
      • It trains a KNeighborsRegressor on X_train.
      • It makes predictions on X_test and calculates the error (mean_squared_error).
    5. cv_errors.append(np.mean(mse_errors_k)): After the inner loop finishes 10 runs, it averages the 10 error scores for that \(K\) and stores it.
    6. The final plot shows cv_errors vs. neighbors, letting you pick the \(K\) with the lowest average error.
  • Logistic Regression Classification (Slides 63-64):
    • This code is almost identical, but with three key differences:
      1. The model is LogisticRegression.
      2. It uses PolynomialFeatures to create new features (\(X^2, X^3,\) etc.) inside the loop.
      3. The error metric is log_loss (a common, more sensitive metric than the simple 0/1 error rate).
    • The plot on slide 64 shows the 10-fold CV error (using Log Loss) vs. the Degree of the Polynomial. The minimum is clearly at Degree = 3, matching the finding from slide 61.

Answering the Key Questions (Slides 54 & 65)

Slide 65 asks two critical questions, which are answered directly by the concepts on Slide 54 (Bias and variance trade-off).

Q1: How does K affect the bias and variance of the CV error?

This refers to \(K\) in K-fold CV (not to be confused with \(K\) in KNN). K 如何影响 CV 误差的偏差和方差?

  • Bias:
    • LOOCV (K = n): This has very low bias. The model is trained on \(n-1\) samples, which is almost the full dataset. So, the error estimate \(CV_{(n)}\) is an almost-unbiased estimate of the true test error. ** 它的偏差非常低。该模型基于 \(n-1\) 个样本进行训练,这几乎是整个数据集。因此,误差估计 \(CV_{(n)}\) 是对真实测试误差的几乎无偏估计。

    • K-Fold (K < n, e.g., K=10): This has slightly higher bias. The models are trained on, for example, 90% of the data. Because they are trained on less data, they might perform slightly worse than a model trained on 100% of the data. This “pessimism” is the source of the bias. 偏差略高。例如,这些模型是基于 90% 的数据进行训练的。由于它们基于较少的数据进行训练,因此它们的性能可能会比基于 100% 数据进行训练的模型略差。这种“悲观”正是偏差的根源。

  • Variance:
    • LOOCV (K = n): This has very high variance. You are training \(n\) models that are almost identical (they only differ by one data point). Averaging \(n\) highly-correlated error estimates doesn’t reduce the variance much. This makes the final \(CV_{(n)}\) estimate unstable. 这种模型的方差非常高**。您正在训练 \(n\)几乎相同的模型(它们只有一个数据点不同)。对 \(n\) 个高度相关的误差估计取平均值并不能显著降低方差。这使得最终的 \(CV_{(n)}\) 估计值不稳定。

    • K-Fold (K < n, e.g., K=10): This has much lower variance. The 10 models are trained on more different “chunks” of data (they overlap less). Their error estimates are less correlated, and averaging 10 less-correlated numbers gives a much more stable (low-variance) final estimate. 这种模型的方差非常低**。这 10 个模型基于更多不同的数据“块”进行训练(它们重叠较少)。它们的误差估计值相关性较低,对 10 个相关性较低的数取平均值可以得到更稳定(低方差)的最终估计值。

Conclusion (The Trade-off): We prefer K-fold CV (K=5 or 10) over LOOCV. It gives a much more stable (low-variance) estimate, and we are willing to accept a tiny increase in bias to get it. 我们更喜欢K 倍交叉验证(K=5 或 10),而不是单倍交叉验证。它能给出更稳定(低方差)的估计值,并且我们愿意接受偏差的轻微增加来获得它。

Q2: Does Cross Validation over-estimate or under-estimate the true test error?

交叉验证会高估还是低估真实测试误差?

Based on the bias discussion above:

Cross-validation (especially K-fold) generally over-estimates the true test error. 交叉验证(尤其是 K 倍交叉验证)通常会高估真实测试误差

Reasoning: 1. The “true test error” is the error of a model trained on the entire dataset (\(n\) samples). 2. K-fold CV trains its models on subsets of the data (e.g., \(n \times (K-1)/K\) samples). 3. Since these models are trained on less data, they are (on average) slightly worse than the final model trained on all the data. 4. Because the CV models are slightly worse, their error rates will be slightly higher. 5. Therefore, the final CV error score is a slightly “pessimistic” or high estimate. This is considered a good thing, as it’s a conservative estimate of how our model will perform. 理由: 1. “真实测试误差”是指在整个数据集\(n\) 个样本)上训练的模型的误差。 2. K 折交叉验证 (K-fold CV) 在数据子集上训练其模型(例如,\(n \times (K-1)/K\) 个样本)。 3. 由于这些模型基于较少的数据进行训练,因此它们(平均而言)比基于所有数据训练的最终模型略差。 4. 由于 CV 模型略差,其错误率会略高。 5. 因此,最终的 CV 错误率是一个略微“悲观”或偏高的估计。这被认为是一件好事,因为它是对模型性能的保守*估计。

6. Summary of Bootstrap

Bootstrap is a resampling technique used to estimate the uncertainty (like standard error or confidence intervals) of a statistic. Its key idea is to treat your original data sample as a proxy for the true population. It then simulates the process of drawing new samples by instead sampling with replacement from your original sample. Bootstrap 是一种重采样技术,用于估计统计数据的不确定性(例如标准误差或置信区间)。其核心思想是将原始数据样本视为真实总体的替代样本。然后,它通过从原始样本中进行有放回的抽样来模拟抽取新样本的过程。

The Problem

You have a single data sample (e.g., \(n=100\) people) and you calculate a statistic, like the sample mean (\(\bar{x}\)) or a regression coefficient (\(\hat{\beta}\)). You want to know how accurate this statistic is. How much would it vary if you could repeat your experiment many times? This variation is measured by the standard error (SE). 您有一个数据样本(例如,\(n=100\) 人),并计算一个统计数据,例如样本均值 (\(\bar{x}\)) 或回归系数 (\(\hat{\beta}\))。您想知道这个统计数据的准确度。如果可以多次重复实验,它会有多少变化?这种变化可以用标准误差 (SE) 来衡量。

The Bootstrap Solution

Since you can’t re-run the whole experiment, you simulate it using the one sample you have. 由于您无法重新运行整个实验,因此您可以使用现有的一个样本进行“模拟”。

The Process: 1. Original Sample (\(Z\)) 原始样本 (\(Z\)): You have your one dataset with \(n\) observations. 2. Bootstrap Sample (\(Z^{*1}\)) Bootstrap 样本 (\(Z^{*1}\)): Create a new dataset of size \(n\) by randomly pulling observations from your original sample with replacement. (This means some original observations will be picked multiple times, and some not at all). 3. Calculate Statistic (\(\hat{\theta}^{*1}\)) 计算统计量 (\(\hat{\theta}^{*1}\)): Calculate your statistic of interest (e.g., the mean, \(\hat{\alpha}\), regression coefficients) on this new bootstrap sample. 4. Repeat 重复: Repeat steps 2 and 3 a large number of times (\(B\), e.g., \(B=1000\)). This gives you \(B\) bootstrap statistics: \(\hat{\theta}^{*1}, \hat{\theta}^{*2}, ..., \hat{\theta}^{*B}\). 5. Analyze the Bootstrap Distribution 分析自举分布: This collection of \(B\) statistics is your “bootstrap distribution.” * Standard Error 标准误差: The standard deviation of this bootstrap distribution is your estimate of the standard error of your original statistic. * Confidence Interval 置信区间: A 95% confidence interval can be found by taking the 2.5th and 97.5th percentiles of this bootstrap distribution.

Why use it? It’s powerful because it doesn’t rely on strong theoretical assumptions (like data being normally distributed). It can be applied to almost any statistic, even very complex ones (like the prediction from a KNN model), for which a simple mathematical formula for standard error doesn’t exist. 它非常强大,因为它不依赖于严格的理论假设(例如数据服从正态分布)。它几乎可以应用于任何统计数据,即使是非常复杂的统计数据(例如 KNN 模型的预测),因为这些统计数据没有简单的标准误差数学公式。

Mathematical Understanding

The core idea is to use the empirical distribution (your sample) as an estimate for the true population distribution. 其核心思想是使用经验分布(你的样本)来估计真实的总体分布

Example: Estimating \(\alpha\)

Your slides provide an example of finding the \(\alpha\) that minimizes the variance of a portfolio, \(var(\alpha X + (1-\alpha)Y)\). 用于计算使投资组合方差最小化的 \(\alpha\),即 \(var(\alpha X + (1-\alpha)Y)\)

  1. True Population Parameter (\(\alpha\)) 真实总体参数 (\(\alpha\)): The true \(\alpha\) is a function of the population variances and covariance: 真实 \(\alpha\)总体方差和协方差的函数: \[\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 - 2\sigma_{XY}}\] We can never know this value exactly unless we know the entire population. 除非我们了解整个总体,否则我们永远无法准确知道这个值。

  2. Sample Statistic (\(\hat{\alpha}\)) 样本统计量 (\(\hat{\alpha}\)): We estimate \(\alpha\) using our sample, creating the statistic \(\hat{\alpha}\) by plugging in our sample variances and covariance: 我们使用样本估计 \(\alpha\),通过代入样本方差和协方差来创建统计量 \(\hat{\alpha}\)\[\hat{\alpha} = \frac{\hat{\sigma}_Y^2 - \hat{\sigma}_{XY}}{\hat{\sigma}_X^2 + \hat{\sigma}_Y^2 - 2\hat{\sigma}_{XY}}\] This \(\hat{\alpha}\) is just one number from our single sample. How confident are we in it? We need its standard error, \(SE(\hat{\alpha})\). 这个 \(\hat{\alpha}\) 只是我们单个样本中的一个数字。我们对它的置信度有多高?我们需要它的标准误差,\(SE(\hat{\alpha})\)

  3. Bootstrap Statistic (\(\hat{\alpha}^*\)) 自举统计量 (\(\hat{\alpha}^*\)): We apply the bootstrap process:

    • Create a bootstrap sample (by resampling with replacement). 创建一个自举样本(通过放回重采样)。
    • Calculate \(\hat{\alpha}^*\) using the sample (co)variances of this new bootstrap sample. 使用这个新自举样本的样本(协)方差计算 \(\hat{\alpha}^*\)
    • Repeat \(B\) times to get \(B\) values: \(\hat{\alpha}^{*1}, \hat{\alpha}^{*2}, ..., \hat{\alpha}^{*B}\). 重复 \(B\) 次,得到 \(B\) 个值:\(\hat{\alpha}^{*1}, \hat{\alpha}^{*2}, ..., \hat{\alpha}^{*B}\)
  4. Estimating the Standard Error 估算标准误差: The standard error of our original estimate \(\hat{\alpha}\) is estimated by the standard deviation of all our bootstrap estimates: 我们原始估计值 \(\hat{\alpha}\) 的标准误差是通过所有自举估计值的标准差来“估算”的: \[SE_{boot}(\hat{\alpha}) = \sqrt{\frac{1}{B-1} \sum_{j=1}^{B} (\hat{\alpha}^{*j} - \bar{\alpha}^*)^2}\] where \(\bar{\alpha}^*\) is the average of all \(B\) bootstrap estimates. \(\bar{\alpha}^*\) 是所有 \(B\) 个自举估计值的平均值。

The slides (p. 73, 77-78) show this visually. The “sampling from population” histogram (left) is the true sampling distribution, which we can only create in a simulation. The “Bootstrap” histogram (right) is the bootstrap distribution created from one sample. They look very similar, which shows the method works. “从总体抽样”直方图(左图)是真实的抽样分布,我们只能在模拟中创建它。“自举”直方图(右图)是从一个样本创建的自举分布。它们看起来非常相似,这表明该方法有效。

Code Analysis

R: \(\alpha\) Example (Slides 75 & 77)

  • Slide 75 (The R code): This is a SIMULATION, not Bootstrap.
    • for(i in 1:m){...}: This loop runs m=1000 times.
    • returns <- rmvnorm(...): Inside the loop, it draws a brand new sample from the true population every time.
    • alpha[i] <- ...: It calculates \(\hat{\alpha}\) for each new sample.
    • Purpose: This code shows the true sampling distribution of \(\hat{\alpha}\) (the “Histogram of alpha”). You can only do this if you know the true population, as in a simulation.
  • Slide 77 (The R code): This IS Bootstrap.
    • returns <- rmvnorm(...): Outside the loop, this is done only once to get one original sample.
    • for(i in 1:B){...}: This is the bootstrap loop.
    • sample(1:nrow(returns), n, replace = T): This is the key line. It randomly selects row numbers with replacement from the single returns dataset.
    • returns_boot <- returns[sample(...), ]: This creates the bootstrap sample.
    • alpha_bootstrap[i] <- ...: It calculates \(\hat{\alpha}^*\) on the returns_boot sample.
    • Purpose: This code generates the bootstrap distribution (the “Bootstrap” histogram on slide 78) to estimate the true sampling distribution.

R: Linear Regression Example (Slides 79 & 81)

  • Slide 79:
    • boot.fn <- function(data, index){ ... }: Defines a function that the boot package needs. It takes data and an index vector.
    • lm(mpg~horsepower, data=data, subset=index): This is the core. It fits a linear model only on the data points specified by the index. The boot function will automatically supply this index as a resampled-with-replacement vector.
    • boot(Auto, boot.fn, R=1000): This runs the bootstrap. It calls boot.fn 1000 times, each time with a new resampled index, and collects the coefficients.
  • Slide 81:
    • summary(lm(...)): Shows the standard output. The “Std. Error” column (e.g., 0.860, 0.006) is calculated using mathematical theory.
    • boot.res: Shows the bootstrap output. The “std. error” column (e.g., 0.841, 0.007) is the standard deviation of the 1000 bootstrap estimates.
    • Main Point: The standard errors from the bootstrap are very close to the theoretical ones. This confirms the uncertainty. If the model assumptions were violated, the bootstrap SE would be more trustworthy.
    • The histograms show the bootstrap distributions for the intercept (t1*) and the slope (t2*). The arrows show the 95% percentile confidence interval.

Python: KNN Regression Example (Slide 80)

This shows how to get a confidence interval for a single prediction.

  • for i in range(n_bootstraps):: The bootstrap loop.
  • indices = np.random.choice(train_samples.shape[0], train_samples.shape[0], replace=True): This is the key line in Python (like sample in R). It gets a new set of indices with replacement.
  • X_boot, y_boot = ...: Creates the bootstrap sample.
  • model.fit(X_boot, y_boot): A new KNN model is trained on this bootstrap sample.
  • bootstrap_preds.append(model.predict(predict_point)): The model (trained on \(Z^{*i}\)) makes a prediction for the same fixed point. This is repeated 1000 times.
  • Result: You get a distribution of predictions for that one point. The 2.5th and 97.5th percentiles of this distribution give you a 95% confidence interval for that specific prediction. 你会得到该点的预测分布。该分布的 2.5 和 97.5 百分位数为该特定预测提供了 95% 的置信区间。

Python: KNN on Auto data (Slide 82)

  • BE CAREFUL: This slide does NOT show Bootstrap. It shows K-Fold Cross-Validation (CV).
  • Purpose: The goal here is not to find uncertainty. The goal is to find the best hyperparameter (the best value for \(k\), the number of neighbors).
  • Method:
    • kf = KFold(n_splits=10): Splits the data into 10 chunks (“folds”).
    • for train_index, test_index in kf.split(X):: It loops 10 times. Each time, it trains on 9 chunks and tests on 1 chunk.
  • Key Difference for Exam:
    • Bootstrap: Samples with replacement to estimate uncertainty/standard error.
    • Cross-Validation: Splits data without replacement into \(K\) folds to estimate model performance/prediction error and tune hyperparameters.
    • 自举法:使用有放回的样本来估计不确定性/标准误差
    • 交叉验证:将数据无放回地分成 \(K\) 份,以估计模型性能/预测误差并调整超参数。

7. The mathematical theory of Bootstrap and the extension to Cross-Validation (CV).

1. Code Analysis: Bootstrap for a KNN Prediction (Slide 85)

This Python code shows a different use of bootstrap: finding the confidence interval for a single prediction, not for a model coefficient.

  • Goal: To estimate the uncertainty of a KNN model’s prediction for a specific new data point (predict_point).
  • Process:
    1. Train Full Model: A KNN model (knn) is first trained on the entire dataset. It makes one prediction (knpred) for predict_point. This is our \(\hat{f}(x_0)\).
    2. Bootstrap Loop (for i in range(n_bootstraps)):
      • indices = np.random.choice(...): This is the core bootstrap step. It creates a new list of indices by sampling with replacement from the original data.
      • X_boot, y_boot = ...: This creates the new bootstrap dataset (\(Z^{*i}\)).
      • km.fit(X_boot, y_boot): A new KNN model (km) is trained only on this bootstrap sample.
      • bootstrap_preds.append(km.predict(predict_point)): This newly trained model makes a prediction for the same predict_point. This value is \(\hat{f}^{*i}(x_0)\).
    3. Analyze Distribution: After 1000 loops, bootstrap_preds contains 1000 different predictions for the same point.
    4. Confidence Interval:
      • np.percentile(bootstrap_preds, [2.5, 97.5]): This finds the 2.5th and 97.5th percentiles of the 1000 bootstrap predictions.
      • The resulting [lower_bound, upper_bound] (e.g., [13.70, 15.70]) forms the 95% confidence interval for the prediction.
  • Histogram Plot: The plot on the right visually confirms this. It shows the distribution of the 1000 bootstrap predictions, with the 95% confidence interval marked by the red dashed lines.

2. Mathematical Understanding: Why Does Bootstrap Work? (Slides 87-88)

This is the theoretical justification for the entire method. It’s based on an analogy. 这是整个方法的理论依据。它基于一个类比。

The “True” World (Slide 87, Top)

  • Population: There is a true, unknown population distribution \(F\). 存在一个真实的、未知的总体分布 \(F\)

  • Parameter: We want to know a true parameter, \(\theta\), which is a function of \(F\) (e.g., the true population mean). 我们想知道一个真实的参数 \(\theta\),它是 \(F\) 的函数(例如,真实的总体均值)。

  • Sample: We get one sample \(X_1, ..., X_n\) from \(F\). 我们从 \(F\) 中获取一个样本 \(X_1, ..., X_n\)

  • Statistic: We calculate our best estimate \(\hat{\theta}\) from our sample. (e.g., the sample mean \(\bar{x}\)). \(\hat{\theta}\) is our proxy for \(\theta\). 我们从样本中计算出最佳估计值 \(\hat{\theta}\)。(例如,样本均值 \(\bar{x}\))。\(\hat{\theta}\)\(\theta\) 的替代值。

  • The Problem: We want to know the accuracy of \(\hat{\theta}\). How much would \(\hat{\theta}\) vary if we could draw many samples? We want the sampling distribution of \(\hat{\theta}\) around \(\theta\), specifically the distribution of the error: \((\hat{\theta} - \theta)\). 我们想知道 \(\hat{\theta}\) 的准确率。如果我们可以抽取多个样本,\(\hat{\theta}\) 会有多少变化?我们想要 \(\hat{\theta}\) 围绕 \(\theta\)抽样分布,具体来说是误差的分布:\((\hat{\theta} - \theta)\)

  • CLT: The Central Limit Theorem states that \(\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{\text{dist}} N(0, Var_F(\theta))\).

  • 中心极限定理\(\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{\text{dist}} N(0, Var_F(\theta))\)

  • The Catch: This is UNKNOWN because we don’t know \(F\).这是未知的,因为我们不知道 \(F\)

The “Bootstrap” World (Slide 87, Bottom)

  • Population: We pretend our original sample is the population. We call its distribution the “empirical distribution,” \(\hat{F}_n\). 我们假设原始样本就是总体。我们称其分布为“经验分布”,即 \(\hat{F}_n\)
  • Parameter: In this new world, the “true” parameter is our original statistic, \(\hat{\theta}\) (which is a function of \(\hat{F}_n\)). 在这个新世界中,“真实”参数是我们原始的统计量 \(\hat{\theta}\)(它是 \(\hat{F}_n\) 的函数)。
  • Sample: We draw many bootstrap samples \(X_1^*, ..., X_n^*\) from \(\hat{F}_n\) (i.e., sampling with replacement from our original sample). 我们从 \(\hat{F}_n\)* 中抽取 许多 自举样本 \(X_1^*, ..., X_n^*\)(即从原始样本中进行 有放回 抽样)。
  • Statistic: From each bootstrap sample, we calculate a bootstrap statistic, \(\hat{\theta}^*\). 从每个自举样本中,我们计算一个 自举统计量,即 \(\hat{\theta}^*\)
  • The Solution: We can now empirically find the distribution of \(\hat{\theta}^*\) around \(\hat{\theta}\). We look at the distribution of the bootstrap error: \((\hat{\theta}^* - \hat{\theta})\). 我们现在可以 凭经验 找到 \(\hat{\theta}^*\) 围绕 \(\hat{\theta}\) 的分布。我们来看看自举误差的分布:\((\hat{\theta}^* - \hat{\theta})\)
  • CLT: The CLT also states that \(\sqrt{n}(\hat{\theta}^* - \hat{\theta}) \xrightarrow{\text{dist}} N(0, Var_{\hat{F}_n}(\theta))\).
  • The Power: This distribution is ESTIMABLE! We just run the bootstrap \(B\) times and we get \(B\) values of \(\hat{\theta}^*\). We can then calculate their variance, standard deviation, and percentiles directly. 这个分布是可估计的!我们只需运行 \(B\) 次自举程序,就能得到 \(B\)\(\hat{\theta}^*\) 值。然后我们可以直接计算它们的方差、标准差和百分位数。

The Core Approximation (Slide 88)

The entire method relies on the assumption that the (knowable) bootstrap distribution is a good approximation of the (unknown) true sampling distribution. 整个方法依赖于以下假设:(已知的)自举分布能够很好地近似(未知的)真实抽样分布

The distribution of the bootstrap error approximates the distribution of the true error. 自举误差的分布近似于真实误差的分布。

\[\text{distribution of } \sqrt{n}(\hat{\theta}^* - \hat{\theta}) \approx \text{distribution of } \sqrt{n}(\hat{\theta} - \theta)\]

This is why: * The standard deviation of the \(\hat{\theta}^*\) values is our estimate for the standard error of \(\hat{\theta}\). 值的标准差是我们对 \(\hat{\theta}\)标准误差的估计值。 * The percentiles of the \(\hat{\theta}^*\) distribution (e.g., 2.5th and 97.5th) can be used to build a confidence interval for the true parameter \(\theta\). 分布的百分位数(例如,第 2.5 个和第 97.5 个)可用于为真实参数 \(\theta\) 建立置信区间

3. Extension: Cross-Validation (CV) Analysis

CV for Hyperparameter Tuning (Slide 84) 超参数调优的 CV

This plot is the result of the 10-fold CV code shown in the previous set of slides (slide 82). * Purpose: To find the optimal hyperparameter \(k\) (number of neighbors) for the KNN model. * X-axis: Number of Neighbors (\(k\)). * Y-axis: CV Error (Mean Squared Error). * Analysis: * Low \(k\) (e.g., \(k=1, 2\)): High error. The model is too complex and overfitting to the training data. * High \(k\) (e.g., \(k>40\)): Error slowly increases. The model is too simple and underfitting (e.g., averaging too many neighbors). * Optimal \(k\): The “sweet spot” is at the bottom of the “U” shape, around \(k \approx 20-30\), which gives the lowest CV error.

  • 目的:为 KNN 模型找到最优超参数 \(k\)(邻居数)。
  • X 轴:邻居数 (\(k\))。
  • Y 轴:CV 误差(均方误差)。
  • 分析:**
  • \(k\)(例如,\(k=1, 2\)):误差较大。模型过于复杂,并且与训练数据过拟合
  • \(k\)(例如,\(k>40\)):误差缓慢增加。模型过于简单且欠拟合(例如,对太多邻居进行平均)。
  • 最优 \(k\)“最佳点”位于“U”形的底部,大约为\(k \approx 20-30\),此时 CV 误差最低。

Why CV Over-Estimates Test Error (Slide 89)

This is a subtle but important theoretical point. * Our Goal: We want to know the test error of our final model (\(\hat{f}^{\text{full}}\)), which we will train on the full dataset (all \(n\) observations). 我们想知道最终模型 (\(\hat{f}^{\text{full}}\)) 的测试误差,我们将在完整数据集(所有 \(n\) 个观测值)上训练该模型。 * What CV Measures: \(k\)-fold CV does not test the final model. It tests \(k\) different models (\(\hat{f}^{(k)}\)), each trained on a smaller dataset (of size \(\frac{k-1}{k} \times n\)). \(k\) 倍 CV 测试最终模型。它测试了 \(k\) 个不同的模型 (\(\hat{f}^{(k)}\)),每个模型都基于一个较小的数据集(大小为 \(\frac{k-1}{k} \times n\))进行训练。

  • The Logic:
    1. Models trained on less data generally perform worse than models trained on more data. 基于较少数据训练的模型通常比基于较多数据训练的模型表现更差
    2. The CV error is the average error of models trained on \(\frac{k-1}{k} n\) observations. CV 误差是使用 \(\frac{k-1}{k} n\) 个观测值训练的模型的平均误差。
    3. The “true test error” is the error of the model trained on \(n\) observations. “真实测试误差”是使用 \(n\) 个观测值训练的模型的误差。
  • Conclusion: Since the CV models are trained on smaller datasets, they will, on average, have a slightly higher error than the final model. Therefore, the CV error score is a slightly pessimistic estimate (it over-estimates) the true test error of the final model. 由于 CV 模型是在较小的数据集上训练的,因此它们的平均误差会略高于最终模型。因此,CV 误差分数是一个略微悲观的估计(它高估了)最终模型的真实测试误差。

Correction of CV Error (Slides 90-91)

  • Theory (Slide 91): Advanced theory suggests the expected test error \(R(n)\) behaves like \(R(n) = R^* + c/n\), where \(R^*\) is the irreducible error and \(n\) is the sample size. This formula mathematically confirms that error decreases as sample size \(n\) increases. 高级理论表明,预期测试误差 \(R(n)\) 的行为类似于 \(R(n) = R^* + c/n\),其中 \(R^*\) 是不可约误差,\(n\) 是样本量。该公式从数学上证实了误差会随着样本量 \(n\) 的增加而减小

  • R Code (Slide 90): The cv.glm function from the boot library automatically provides this.

    • cv.err$delta: This output vector contains two values.
    • [1] 24.23151 (Raw CV Error): This is the standard Leave-One-Out CV (LOOCV) error.
    • [2] 24.23114 (Adjusted CV Error): This is a bias-corrected estimate that accounts for the overestimation problem. It’s slightly lower, representing a more accurate guess for the error of the final model trained on all \(n\) data points.

# The “Correction of CV Error” extension.

Summary

This section provides a deeper mathematical look at why k-fold cross-validation (CV) slightly over-estimates the true test error. 本节从数学角度更深入地阐述了 为什么 k 折交叉验证 (CV) 会略微高估真实测试误差。

  1. The Overestimation 高估: CV trains on \(\frac{k-1}{k}\) of the data, which is less than the full dataset (size \(n\)). Models trained on less data are generally worse. Therefore, the average error from CV (\(CV_k\)) is slightly higher (more pessimistic) than the true error of the final model trained on all \(n\) data (\(R(n)\)). CV 训练的数据为 \(\frac{k-1}{k}\),小于完整数据集(大小为 \(n\))。使用较少数据训练的模型通常更差。因此,CV 的平均误差 (\(CV_k\)) 略高于(更悲观地)基于所有 \(n\) 个数据训练的最终模型的真实误差 (\(R(n)\))。

  2. A Simple Correction 简单修正: A mathematical formula, \(\tilde{CV_k} = \frac{k-1}{k} \cdot CV_k\), is proposed to “correct” this overestimation.

  3. The Critical Flaw 关键缺陷: This correction is derived assuming the irreducible error (\(R^*\)) is zero.此修正是在假设不可约误差 (\(R^*\)) 为零的情况下得出的。

  4. The Takeaway 要点 (Code Analysis): The Python code demonstrates a real-world scenario where there is noise (noise_std = 0.5), meaning \(R^* > 0\). In this case, the simple correction fails—it produces an error (0.217) that is less accurate and further from the true error (0.272) than the original raw CV error (0.271).

Python 代码演示了一个存在噪声(noise_std = 0.5)的真实场景,即 \(R^* > 0\)。在这种情况下,简单修正失败——它产生的误差 (0.217) 精度较低,并且与真实误差 (0.272) 的距离比原始 CV 误差 (0.271) 更远。

Exam Conclusion: For most real-world problems (which have noise), the raw \(k\)-fold CV error is a better and more reliable estimate of the true test error than the simple (and flawed) correction. 对于大多数实际问题(包含噪声),原始 \(k\) 倍 CV 误差比简单(且有缺陷的)修正方法更能准确、可靠地估计真实测试误差

Mathematical Understanding

This section explains the theory of why \(CV_k > R(n)\) and derives the simple correction. 本节解释了为什么 \(CV_k > R(n)\),并推导出简单的修正方法。

  1. Assumed Error Behavior 假设误差行为: We assume the test error \(R(n)\) for a model trained on \(n\) data points behaves like: 我们假设基于 \(n\) 个数据点训练的模型的测试误差 \(R(n)\) 的行为如下: \[R(n) = R^* + \frac{c}{n}\]

    • \(R^*\): The irreducible error (the “noise floor” you can never beat). 不可约误差(即你永远无法克服的“本底噪声”)。
    • \(c/n\): The model variance, which decreases as sample size \(n\) increases. 模型方差,随着样本量 \(n\) 的增加而减小。
  2. Test Error vs. CV Error 测试误差 vs. CV 误差:

    • Test Error of Interest: This is the error of our final model trained on all \(n\) points: \[R(n) = R^* + \frac{c}{n}\]
    • 感兴趣的测试误差:这是我们在所有 \(n\) 个点上训练的最终模型的误差:
    • k-fold CV Error: This is the average error of \(k\) models, each trained on a smaller sample of size \(n' = (\frac{k-1}{k})n\).
    • k 倍 CV 误差:这是 \(k\) 个模型的平均误差,每个模型都使用一个较小的样本(大小为 \(n' = (\frac{k-1}{k})n\))进行训练。 \[CV_k \approx R(n') = R\left(\frac{k-1}{k}n\right) = R^* + \frac{c}{\left(\frac{k-1}{k}\right)n} = R^* + \frac{ck}{(k-1)n}\]
  3. The Overestimation 高估: Let’s compare \(CV_k\) and \(R(n)\): \[CV_k \approx R^* + \left(\frac{k}{k-1}\right) \frac{c}{n}\] \[R(n) = R^* + \left(\frac{k-1}{k-1}\right) \frac{c}{n}\] Since \(k > (k-1)\), the factor \(\left(\frac{k}{k-1}\right)\) is greater than 1. This means the \(CV_k\) error term is larger than the \(R(n)\) error term. Thus: \(CV_k > \text{Test error of interest } R(n)\) 由于 \(k > (k-1)\),因子 \(\left(\frac{k}{k-1}\right)\) 大于 1。这意味着 \(CV_k\) 误差项大于 \(R(n)\) 误差项。因此: \(CV_k > \text{目标测试误差 } R(n)\)

  4. Deriving the (Flawed) Correction 推导(有缺陷的)修正: This correction makes a strong assumption: \(R^* \approx 0\) (the model is perfectly specified, and there is no noise). 此修正基于一个强假设:\(R^* \approx 0\)(模型完全正确,且无噪声)。

    • If \(R^* = 0\), then \(R(n) \approx \frac{c}{n}\)
    • If \(R^* = 0\), then \(CV_k \approx \frac{ck}{(k-1)n}\)

    Now, look at the ratio between them: \[\frac{R(n)}{CV_k} \approx \frac{c/n}{ck/((k-1)n)} = \frac{c}{n} \cdot \frac{(k-1)n}{ck} = \frac{k-1}{k}\]

    This gives us the correction formula by isolating \(R(n)\): 通过分离 \(R(n)\),我们得到了校正公式: \[R(n) \approx \left(\frac{k-1}{k}\right) \cdot CV_k\] This corrected version is denoted \(\tilde{CV_k}\).这个校正版本表示为 \(\tilde{CV_k}\)

Code Analysis (Slides 92-93)

The Python code is an experiment designed to test the correction formula.

  • Goal: Compare the “Raw CV Error” (\(CV_k\)), the “Corrected CV Error” (\(\tilde{CV_k}\)), and the “True Test Error” (\(R(n)\)) in a realistic setting.

  • Key Setup:

    1. def f(x): Defines the true, underlying function \(y = x^2 + 15\sin(x)\).
    2. noise_std = 0.5: This is the most important line. It adds significant random noise to the data. This ensures that the irreducible error \(R^*\) is large and \(R^* > 0\).
    3. y = f(...) + np.random.normal(...): Creates the noisy training data (the blue dots).
  • CV Calculation (Standard K-Fold):

    • kf = KFold(...): Sets up 5-fold CV (\(k=5\)).
    • for train_index, val_index in kf.split(x):: This is the standard loop. It trains on 4 folds and validates on 1 fold.
    • cv_error = np.mean(cv_mse_list): Calculates the raw \(CV_5\) error. This is the first result (e.g., 0.2715).
  • Correction Calculation:

    • correction_factor = (k_splits - 1) / k_splits: This is \(\frac{k-1}{k}\), which is \(4/5 = 0.8\).
    • corrected_cv_error = correction_factor * cv_error: This applies the flawed formula from the math section (\(0.2715 \times 0.8\)). This is the second result (e.g., 0.2172).
  • “True” Test Error Calculation:

    • knn.fit(x, y): Trains the final model on the entire noisy dataset.
    • n_test = 1000: Creates a new, large test set to estimate the true error.
    • true_test_error = mean_squared_error(...): Calculates the error of the final model on this new test set. This is our best estimate of \(R(n)\) (e.g., 0.2725).
  • Analysis of Results (Slide 93):

    • Raw 5-Fold CV MSE: 0.2715
    • True test error: 0.2725
    • Corrected 5-Fold CV MSE: 0.2172

    The Raw CV Error (0.2715) is an excellent estimate of the True Test Error (0.2725). The Corrected Error (0.2172) is much worse. This experiment proves that when noise (\(R^*\)) is present, the simple correction formula should not be used.