Diffusion Model

发表于 2025-11-04 分类于扩散模型

第 1 部分：基础数学工具与物理背景

最近在尝试理解扩散模型所依赖的核心数学概念：高斯分布的性质、朗之万动力学（SDE）及其离散化。

1. 核心数学工具：高斯分布 (Gaussian Distribution)

扩散模型（尤其是 DDPM）的全部推导都建立在高斯分布的美妙性质之上。

1.1 定义

一个一维高斯分布（正态分布）由均值 $\mu$ 和方差 $\sigma^2$ 定义，其概率密度函数 (PDF) 为： \[ \mathcal{N}(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \] 对于 $D$ 维向量 $\mathbf{x}$（例如视频中 $32 \times 32 = 1024$ 维的图片向量），多维高斯分布由均值向量 $\boldsymbol{\mu}$ 和协方差矩阵 $\boldsymbol{\Sigma}$ 定义： \[ \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2\pi)^D \det(\boldsymbol{\Sigma})}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right) \] > 关键简化： 在 DDPM 中，我们通常假设协方差矩阵是对角矩阵 $\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}$，其中 $\mathbf{I}$ 是单位矩阵。这意味着向量的每个维度（每个像素）是独立同分布的（IID）。

1.2 关键性质 1：重参数化技巧 (Reparameterization Trick)

问题： 我们如何从 $\mathcal{N}(\mu, \sigma^2)$ 中采样？直接对这个分布采样是困难的，且在神经网络中无法传递梯度。

技巧： 我们可以从一个标准高斯分布 $\epsilon \sim \mathcal{N}(0, 1)$ 中采样，然后通过线性变换得到 $x$： \[ x = \mu + \sigma \cdot \epsilon \] 推导： * 1.： 设 $\epsilon \sim \mathcal{N}(0, 1)$，即 $E[\epsilon] = 0, \text{Var}(\epsilon) = 1$。 * 2.： 我们构造 $x = \mu + \sigma \epsilon$。 * 计算 $x$ 的均值：$E[x] = E[\mu + \sigma \epsilon] = E[\mu] + \sigma E[\epsilon] = \mu + \sigma \cdot 0 = \mu$。 * 3.： 计算 $x$ 的方差：$\text{Var}(x) = \text{Var}(\mu + \sigma \epsilon) = \text{Var}(\sigma \epsilon) = \sigma^2 \text{Var}(\epsilon) = \sigma^2 \cdot 1 = \sigma^2$。 * 4.： 由于高斯分布的线性变换仍然是高斯分布，因此 $x \sim \mathcal{N}(\mu, \sigma^2)$。

意义： 这使得采样过程可微。$\mu$ 和 $\sigma$ 可以是神经网络的输出，$\epsilon$ 作为外部噪声输入，梯度可以反向传播回 $\mu$ 和 $\sigma$。

1.3 关键性质 2：两个独立高斯分布之和 (Sum of Independent Gaussians)

问题： 两个独立高斯分布相加会怎样？这是前向加噪过程（Forward Process）的核心。

性质： 如果 $X_1 \sim \mathcal{N}(\mu_1, \sigma_1^2)$ 且 $X_2 \sim \mathcal{N}(\mu_2, \sigma_2^2)$ 相互独立，那么它们的和 $Y = X_1 + X_2$ 仍然是高斯分布： \[ Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \] 推导： * 均值： $E[Y] = E[X_1 + X_2] = E[X_1] + E[X_2] = \mu_1 + \mu_2$。 * 方差： 由于 $X_1$ 和 $X_2$ 相互独立，$\text{Cov}(X_1, X_2) = 0$。 $\text{Var}(Y) = \text{Var}(X_1 + X_2) = \text{Var}(X_1) + \text{Var}(X_2) + 2 \text{Cov}(X_1, X_2) = \sigma_1^2 + \sigma_2^2$。 * （两个独立高斯变量之和仍为高斯分布，这可以通过矩生成函数或特征函数严格证明，这里我们接受这个结论。）

意义： 这使得我们可以计算前向过程中任意 $t$ 时刻 $x_t$ 从 $x_0$ 开始累积加噪的结果。

2. 物理与SDE：朗之万动力学 (Langevin Dynamics)

这是扩散模型的物理原型。

2.1 随机微分方程 (SDE)

朗之万动力学描述了一个粒子（在我们的例子中是图片向量 $\mathbf{x}$）在势场 $U(\mathbf{x})$ 中运动，同时受到随机力（如布朗运动）的影响。其 SDE 形式为： \[ d\mathbf{x}_t = \mathbf{f}(\mathbf{x}_t, t)dt + g(t)d\mathbf{w}_t \] * $\mathbf{x}_t$：$t$ 时刻的粒子位置（图片向量）。 * $\mathbf{f}(\mathbf{x}_t, t)$：漂移项 (Drift)。代表确定性的力，如视频中提到的 “吸引回原点的线性运动”（例如 $\mathbf{f}(\mathbf{x}, t) = -\beta \mathbf{x}$，使分布趋向原点）。它对应能量函数的负梯度 $-\nabla U(\mathbf{x})$。 * $g(t)$：扩散项 (Diffusion)。控制随机噪声的强度。 * $d\mathbf{w}_t$：维纳过程 (Wiener Process) 或布朗运动。它是一个随机项，其增量 $d\mathbf{w}_t$ 在 $dt$ 时间内服从高斯分布 $d\mathbf{w}_t \sim \mathcal{N}(0, \mathbf{I} dt)$。

2.2 SDE 的离散化：欧拉-丸山法 (Euler-Maruyama)

问题： 计算机无法处理连续时间 $dt$。我们如何模拟这个 SDE？

方法： 我们使用欧拉近似法（在 SDE 中称为 Euler-Maruyama）将其离散化为小的时间步 $\Delta t$。 \[ \mathbf{x}_{t+\Delta t} - \mathbf{x}_t \approx \mathbf{f}(\mathbf{x}_t, t)\Delta t + g(t) (\mathbf{w}_{t+\Delta t} - \mathbf{w}_t) \] 根据维纳过程的性质，在 $\Delta t$ 时间内的增量 $(\mathbf{w}_{t+\Delta t} - \mathbf{w}_t)$ 服从 $\mathcal{N}(0, \mathbf{I} \Delta t)$。根据性质 1.2（重参数化），$\mathcal{N}(0, \mathbf{I} \Delta t)$ 可以写成 $\sqrt{\Delta t} \cdot \mathbf{z}$，其中 $\mathbf{z} \sim \mathcal{N}(0, \mathbf{I})$。

离散迭代公式： \[ \mathbf{x}_{t+\Delta t} \approx \mathbf{x}_t + \mathbf{f}(\mathbf{x}_t, t)\Delta t + g(t) \sqrt{\Delta t} \mathbf{z}_t \] 其中 $\mathbf{z}_t \sim \mathcal{N}(0, \mathbf{I})$ 是在 $t$ 时刻采样的标准高斯噪声。

这就是 DDPM 前向加噪过程 (Forward Process) 的数学原型。

3. 核心数学工具：贝叶斯公式 (Bayes’ Theorem)

贝叶斯公式是连接前向过程（加噪）和反向过程（去噪）的桥梁。

对于连续变量（概率密度函数）： \[ p(x|y) = \frac{p(y|x) p(x)}{p(y)} \] 其中 $p(y) = \int p(y|x) p(x) dx$。

在扩散模型中的应用： * 1.： 我们定义了一个简单的前向加噪过程 $q(x_t | x_{t-1})$（易于计算）。 * 2.： 我们想要的是反向去噪过程 $p(x_{t-1} | x_t)$（难以计算）。 * 3.： 贝叶斯公式告诉我们：$p(x_{t-1} | x_t) \propto p(x_t | x_{t-1}) p(x_{t-1})$。 * 4.： 在 DDPM 中，我们会看到一个更复杂的形式，它利用了 $x_0$： \[ q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0)}{q(x_t | x_0)} \]

小结

高斯分布的性质（重参数化、加法），能精确计算加噪后的分布。
朗之万动力学与欧拉近似，为 “逐步加噪” 提供了物理和数学模型。
贝叶斯公式，指明了如何从 “加噪” 倒推出 “去噪”。

2. DDPM 前向过程：从图像到噪声 (The Forward Process)

前向过程的目标是模拟大纲中描述的“图片在向量空间中逐步噪声化的轨迹”。我们定义一个马尔可夫过程，在该过程中，我们从原始数据 $\mathbf{x}_0 \sim q(\mathbf{x}_0)$（即真实图片分布）开始，在 $T$ 个离散的时间步中逐步向其添加高斯噪声。

2.1 单步加噪过程 $q(\mathbf{x}_t | \mathbf{x}_{t-1})$

在每个 $t$ 步，我们向 $\mathbf{x}_{t-1}$ 添加少量噪声以生成 $\mathbf{x}_t$。这个过程被定义为一个高斯转变（这源于我们第 1 部分中对朗之万动力学的离散化）：

\[ q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) \]

$\{\beta_t\}_{t=1}^T$ 是一个预先设定的方差表 (variance schedule)。它们是 $T$ 个很小的正常数（例如，$\beta_1 = 10^{-4}, \beta_T = 0.02$）。
$\sqrt{1 - \beta_t} \mathbf{x}_{t-1}$：这是缩放项（大纲）。我们在添加噪声之前先将前一步的图片向量“缩小”一点。
$\beta_t \mathbf{I}$：这是噪声项。$\beta_t$ 是添加的噪声的方差，$\mathbf{I}$ 是单位矩阵，表示噪声在所有维度（像素）上是独立同分布的。

重参数化技巧的应用： 我们可以使用第 1 部分中的重参数化技巧来显式地写出这个采样过程： \[ \mathbf{x}_t = \sqrt{1 - \beta_t} \mathbf{x}_{t-1} + \sqrt{\beta_t} \boldsymbol{\epsilon}_{t-1} \] 其中 $\boldsymbol{\epsilon}_{t-1} \sim \mathcal{N}(0, \mathbf{I})$ 是在 $t-1$ 时刻采样的一个标准高斯噪声。

2.2 累积加噪过程 $q(\mathbf{x}_t | \mathbf{x}_0)$ (核心推导)

问题： 在训练期间（如大纲所述），我们希望随机跳到任意 $t$ 步并生成 $\mathbf{x}_t$。如果我们必须从 $\mathbf{x}_0$ 迭代 $t$ 次，这将非常缓慢。

目标： 我们需要一个公式，能让我们从 $\mathbf{x}_0$ 一次性得到 $\mathbf{x}_t$ 的分布 $q(\mathbf{x}_t | \mathbf{x}_0)$。这就是大纲中提到的“构造出高斯分布的累计变换”。

推导： 1. 定义新变量： 为了简化推导，我们定义 $\alpha_t = 1 - \beta_t$ 和 $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$。 * $\alpha_t$ 是每一步的缩放因子。 * $\bar{\alpha}_t$ 是从第 1 步到第 $t$ 步的累积缩放因子。

展开迭代 (Step-by-step Expansion)： 让我们从 $\mathbf{x}_t$ 开始，逐步代入 $\mathbf{x}_{t-1}$： \[ \mathbf{x}_t = \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1} \] 现在，我们代入 $\mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \boldsymbol{\epsilon}_{t-2}$： \[ \begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t} (\sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \boldsymbol{\epsilon}_{t-2}) + \sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{\alpha_t(1 - \alpha_{t-1})} \boldsymbol{\epsilon}_{t-2} + \sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1} \end{aligned} \]
合并高斯噪声 (Merging Gaussians)： 注意上式的后两项：$\sqrt{\alpha_t(1 - \alpha_{t-1})} \boldsymbol{\epsilon}_{t-2}$ 和 $\sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1}$。
- $\boldsymbol{\epsilon}_{t-2}$ 和 $\boldsymbol{\epsilon}_{t-1}$ 是两个独立的标准高斯分布。
- 我们正在对两个独立的、均值为 0 的高斯分布进行线性组合。
- 根据第 1 部分的性质 1.3 (两个独立高斯分布之和)，它们的和仍然是一个均值为 0 的高斯分布。
- 这个新的高斯分布的方差是多少？ \[ \begin{aligned} \text{Var}(\text{new\_noise}) &= \text{Var}(\sqrt{\alpha_t(1 - \alpha_{t-1})} \boldsymbol{\epsilon}_{t-2}) + \text{Var}(\sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1}) \\ &= (\alpha_t(1 - \alpha_{t-1})) \mathbf{I} + (1 - \alpha_t) \mathbf{I} \\ &= (\alpha_t - \alpha_t\alpha_{t-1} + 1 - \alpha_t) \mathbf{I} \\ &= (1 - \alpha_t\alpha_{t-1}) \mathbf{I} \end{aligned} \]
- 根据重参数化技巧，一个方差为 $(1 - \alpha_t\alpha_{t-1}) \mathbf{I}$ 的高斯分布，可以写成 $\sqrt{1 - \alpha_t\alpha_{t-1}} \cdot \bar{\boldsymbol{\epsilon}}_{t-2}$，其中 $\bar{\boldsymbol{\epsilon}}_{t-2} \sim \mathcal{N}(0, \mathbf{I})$ 是一个新的标准高斯噪声。
递归与通项公式： 我们将合并后的噪声代入第 2 步的展开式： \[ \mathbf{x}_t = \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} \]
- $\mathbf{x}_t$ 和 $\mathbf{x}_{t-2}$ 的关系，与 $\mathbf{x}_t$ 和 $\mathbf{x}_{t-1}$ 的关系（$\mathbf{x}_t = \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1}$）在形式上是完全一致的！只是 $\alpha_t$ 变成了 $\alpha_t \alpha_{t-1}$。
- 我们可以将这个模式递归地应用 $t$ 次： \[ \begin{aligned} \mathbf{x}_t &= \sqrt{(\alpha_t \alpha_{t-1} \cdots \alpha_1)} \mathbf{x}_0 + \sqrt{1 - (\alpha_t \alpha_{t-1} \cdots \alpha_1)} \boldsymbol{\epsilon} \\ &= \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} \end{aligned} \]
- 其中 $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$ 是一个（合并了 $t$ 次的）标准高斯噪声。

2.3 前向过程的最终公式

我们得到了前向过程中最关键的累积加噪公式： \[ q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I}) \] 这个公式的重参数化形式为： \[ \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} \] 意义： * 训练效率： 这个公式是 DDPM 训练效率的关键。我们不需要迭代 $t$ 次来生成 $\mathbf{x}_t$。 * 随机训练： 在训练神经网络时，我们可以： 1. 从数据集中拿一张清晰图片 $\mathbf{x}_0$。 2. 随机选择一个时间步 $t$（例如 $t=150$）。 3. 从 $\mathcal{N}(0, \mathbf{I})$ 中采样一个噪声 $\boldsymbol{\epsilon}$。 4. 使用上述公式一步计算出 $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$。 5. 将 $(\mathbf{x}_t, t, \boldsymbol{\epsilon})$ 喂给神经网络进行训练。

当 $t \to T$ 时（例如 $T=1000$），$\bar{\alpha}_T = \prod_{i=1}^T (1 - \beta_i)$。由于所有的 $\beta_i > 0$，$\bar{\alpha}_T$ 会非常接近 0。此时： \[ \mathbf{x}_T \approx \sqrt{0} \mathbf{x}_0 + \sqrt{1 - 0} \boldsymbol{\epsilon} = \boldsymbol{\epsilon} \] 这意味着，在 $T$ 步之后，$\mathbf{x}_T$ 的分布 $q(\mathbf{x}_T | \mathbf{x}_0) \approx \mathcal{N}(0, \mathbf{I})$，它几乎完全变成了标准高斯噪声，并且与 $\mathbf{x}_0$ 无关。

成功地将复杂的图片分布 $q(\mathbf{x}_0)$ 转化为了简单的标准高斯分布 $q(\mathbf{x}_T)$。

小结

核心问题：如何逆转这个过程？如何从一张纯噪声图片 $\mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I})$ 出发，一步步去噪，最终得到一张清晰的图片 $\mathbf{x}_0$？

这需要推导反向去噪过程 (Reverse Process) $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$。

3. 反向过程：从噪声到图像 (The Reverse Process)

我们的目标是学习反向的马尔可夫链，即 $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$。

3.1 棘手的目标 $p(\mathbf{x}_{t-1} | \mathbf{x}_t)$

我们想从 $\mathbf{x}_t$ 推导出 $\mathbf{x}_{t-1}$。根据贝叶斯公式： \[ p(\mathbf{x}_{t-1} | \mathbf{x}_t) = \frac{p(\mathbf{x}_t | \mathbf{x}_{t-1}) p(\mathbf{x}_{t-1})}{p(\mathbf{x}_t)} \] * $p(\mathbf{x}_t | \mathbf{x}_{t-1})$ 就是前向过程 $q(\mathbf{x}_t | \mathbf{x}_{t-1})$，我们已知。 * $p(\mathbf{x}_{t-1})$ 是 $t-1$ 时刻的边缘分布，需要对所有 $\mathbf{x}_0$ 积分 $p(\mathbf{x}_{t-1}) = \int q(\mathbf{x}_{t-1} | \mathbf{x}_0) q(\mathbf{x}_0) d\mathbf{x}_0$，这依赖于 $q(\mathbf{x}_0)$（真实数据分布），极其困难 (intractable)。 * $p(\mathbf{x}_t)$ 同样难以计算。

3.2 DDPM 的核心创见：利用 $\mathbf{x}_0$ (对应)

关键洞察： 虽然 $p(\mathbf{x}_{t-1} | \mathbf{x}_t)$ 难以计算，但如果我们额外知道 $\mathbf{x}_0$，这个后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 是可计算的！

为什么？因为我们定义了所有 $q$ 的前向步骤。我们再次使用贝叶斯公式 (对应)： \[ q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} \] 利用马尔可夫性质 $q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) = q(\mathbf{x}_t | \mathbf{x}_{t-1})$，我们得到： \[ q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \propto q(\mathbf{x}_t | \mathbf{x}_{t-1}) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0) \] 我们已知这三个分布都是高斯分布（在第 2 部分已推导）： 1. $q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$ 2. $q(\mathbf{x}_{t-1} | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1 - \bar{\alpha}_{t-1}) \mathbf{I})$ 3. $q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$

我们正在用 $\mathbf{x}_{t-1}$ 作为变量，乘以两个高斯分布的概率密度函数 (PDF)。高斯 PDF 的形式是 $C \cdot \exp(-\frac{(x - \mu)^2}{2\sigma^2})$。两个高斯 PDF 相乘的结果仍然是一个高斯分布。

通过匹配 $\mathbf{x}_{t-1}$ 的一次项和二次项系数（一个繁琐但直接的代数过程），我们可以解出这个新高斯分布的均值 $\tilde{\boldsymbol{\mu}}_t$ 和方差 $\tilde{\beta}_t$：

\[ q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I}) \] 其中： * 方差 $\tilde{\beta}_t$：不依赖于 $\mathbf{x}_t$ 或 $\mathbf{x}_0$，它是一个固定的超参数。 \[ \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t \] * 均值 $\tilde{\boldsymbol{\mu}}_t$：依赖于 $\mathbf{x}_t$ 和 $\mathbf{x}_0$。 \[ \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t \]

4. 训练：学习反向过程 (Training)

4.1 神经网络的目标

我们有了一个完美的目标分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$。但它有个问题：在推理 (Inference) 时，我们从 $\mathbf{x}_T$ 开始，并不知道 $\mathbf{x}_0$。

因此，我们训练一个神经网络 $p_\theta$ 来近似这个分布： \[ p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \] 我们的目标是让 $p_\theta$ 尽可能接近 $q$。 * 简化1 (固定方差)：DDPM 论文发现，将神经网络的方差 $\boldsymbol{\Sigma}_\theta$ 固定为 $\tilde{\beta}_t \mathbf{I}$ 或 $\beta_t \mathbf{I}$ 效果最好。这极大地简化了问题：神经网络只需要学习均值 $\boldsymbol{\mu}_\theta$。 * 简化2 (学习目标)：我们训练 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ 来预测真实均值 $\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)$。

4.2 DDPM 的关键改进：预测噪声 (对应)

$\tilde{\boldsymbol{\mu}}_t$ 的公式 $\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t$ 仍然很复杂。

DDPM 论文提出了一个重要的的重参数化： 我们回顾第 2 部分的前向公式：$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$ 我们可以用它来反解 $\mathbf{x}_0$（在 $\mathbf{x}_t$ 和 $\boldsymbol{\epsilon}$ 已知的情况下）： \[ \mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} (\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}) \] 现在，我们将这个 $\mathbf{x}_0$ 的表达式代入上面 $\tilde{\boldsymbol{\mu}}_t$ 的复杂公式中： \[ \begin{aligned} \tilde{\boldsymbol{\mu}}_t &= \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \left( \frac{1}{\sqrt{\bar{\alpha}_t}} (\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}) \right) + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t \\ &= \left( \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{(1 - \bar{\alpha}_t)\sqrt{\bar{\alpha}_t}} + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \right) \mathbf{x}_t - \left( \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t \sqrt{1 - \bar{\alpha}_t}}{(1 - \bar{\alpha}_t)\sqrt{\bar{\alpha}_t}} \right) \boldsymbol{\epsilon} \end{aligned} \] (经过一系列基于 $\bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1}$ 和 $\beta_t = 1 - \alpha_t$ 的代数化简) \[ \tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon} \right) \] 分析这个优美的公式： * $\alpha_t$, $\beta_t$, $\bar{\alpha}_t$ 都是预先设定的超参数。 * $\mathbf{x}_t$ 是神经网络的输入。 * 唯一未知的就是 $\boldsymbol{\epsilon}$ —— 那个在第 2 部分用于从 $\mathbf{x}_0$ 生成 $\mathbf{x}_t$ 的原始噪声。

结论（DDPM 核心思想）： (对应) 与其让神经网络 $\boldsymbol{\mu}_\theta$ 预测那个复杂的均值 $\tilde{\boldsymbol{\mu}}_t$，我们可以让它转而去预测这个噪声 $\boldsymbol{\epsilon}$。我们定义一个神经网络（通常是 U-Net 结构）$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$，它的目标就是预测 $\boldsymbol{\epsilon}$。

4.3 训练流程与损失函数 (对应-[01:24:40])

从数据集中随机抽取一张清晰图像 $\mathbf{x}_0$。
随机选择一个时间步 $t$（从 1 到 $T$）。(对应随机训练)
随机采样一个标准高斯噪声 $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$。(对应)
使用前向公式一步生成加噪图像：$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$。
将 $(\mathbf{x}_t, t)$ 作为输入，喂给神经网络 $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$，得到预测噪声 $\boldsymbol{\epsilon}_\theta$。
计算损失函数（均方误差 MSE）：(对应) \[ L(\theta) = E_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ || \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) ||^2 \right] \]
使用梯度下降更新网络参数 $\theta$。

5. 推理：逐步去噪生成图像 (Inference)

当训练好 $\boldsymbol{\epsilon}_\theta$ 后，就可以从纯噪声生成图像了：

起始： 从标准高斯分布中采样一张纯噪声图像 $\mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I})$。
迭代： 从 $t = T$ 循环到 $t = 1$：
1. 将当前的 $\mathbf{x}_t$ 和时间步 $t$ 输入网络，得到噪声预测：$\boldsymbol{\epsilon}_\theta = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$。(对应)
2. 使用 $\boldsymbol{\epsilon}_\theta$ 作为我们对 $\boldsymbol{\epsilon}$ 的最佳估计，代入 4.2 中的均值公式，计算 $t-1$ 步的均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$：(对应) \[ \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta \right) \]
3. 计算 $t-1$ 步的方差。我们使用固定的方差 $\sigma_t^2 \mathbf{I} = \tilde{\beta}_t \mathbf{I} = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t \mathbf{I}$。
4. 采样 (Sampling) (对应)：从这个高斯分布中采样 $\mathbf{x}_{t-1}$： \[ \mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z} \] 其中 $\mathbf{z} \sim \mathcal{N}(0, \mathbf{I})$ 是一个新采样的随机噪声。（注意：当 $t=1$ 时，$\mathbf{z}$ 设为 0，因为 $\mathbf{x}_0$ 应该是一个确定性的输出，不再添加噪声）。
结束： 当循环结束时，$\mathbf{x}_0$ 就是生成的清晰图像。(对应)

小结

完整地推导了 DDPM 的核心数学原理： 1. 前向过程 $q$：使用 $q(\mathbf{x}_t | \mathbf{x}_0)$ 高效加噪。 2. 反向过程 $p_\theta$：通过贝叶斯公式推导出 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 作为理想目标。 3. 训练 $p_\theta$：通过让 $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ 预测真实噪声 $\boldsymbol{\epsilon}$ 来简化训练目标 (MSE Loss)。 4. 推理 $p_\theta$：从 $\mathbf{x}_T$ 开始，利用 $\boldsymbol{\epsilon}_\theta$ 预测的均值，逐步采样 $\mathbf{x}_{t-1}$。

DDPM 的一个主要缺点是推理速度慢（需要 $T$ 步，例如 1000 步）。

DDIM (Denoising Diffusion Implicit Models)。

DDPM 的效果很好，但它有两个主要缺点： 1. 推理速度慢： 它是一个马尔可夫过程，从 $\mathbf{x}_T$ 生成 $\mathbf{x}_0$ 必须执行 $T$ 步（例如 1000 步）采样，非常耗时。 2. 推理是随机的： (Stochastic) 在每一步采样 $\mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}$ 时，都需要加入一个新的随机噪声 $\mathbf{z}$。这意味着即使从同一个 $\mathbf{x}_T$ 出发，两次推理也会得到不同的 $\mathbf{x}_0$。这对于需要一致性的任务（如图像编辑）来说是个问题。

DDIM（2020年提出）巧妙地解决了这两个问题，并且无需重新训练在 DDPM 上训练好的模型。

这对应于您大纲中的 - 部分。

6. DDIM：推理升级

6.1 DDIM 的核心洞察：重新审视反向过程

DDIM 的出发点是重新审视我们推导出的反向过程。DDPM 假设反向过程是 $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$，并用它来近似 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$。

DDIM 注意到，我们训练的神经网络 $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ 实际上是在预测噪声 $\boldsymbol{\epsilon}$。回顾我们的前向公式： \[ \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} \] 既然我们有了 $\mathbf{x}_t$（当前输入）和 $\boldsymbol{\epsilon}_\theta$（网络预测的 $\boldsymbol{\epsilon}$），我们可以直接反解出对清晰图像 $\mathbf{x}_0$ 的预测，我们称之为 $\hat{\mathbf{x}}_0$：

\[ \hat{\mathbf{x}}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) \] 这个 $\hat{\mathbf{x}}_0$ 是给定 $\mathbf{x}_t$ 时，模型对最终结果 $\mathbf{x}_0$ 的“最佳猜测”。

6.2 升级 1：确定性推理 (Deterministic Inference) (对应)

DDPM 的采样公式为： \[ \mathbf{x}_{t-1} = \underbrace{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)}_{\text{均值}} + \underbrace{\sigma_t \mathbf{z}}_{\text{随机噪声}} \] DDIM 提出，这个过程不一定是随机的。DDIM 引入了一个新的参数 $\eta$ (eta) 来控制随机性。 * 当 $\eta=1$ 时，DDIM 的采样过程与 DDPM 完全相同（随机）。 * 当 $\eta=0$ 时，采样过程中的方差 $\sigma_t$ 被设为 0。

当 $\eta=0$（方差 $\sigma_t=0$）时，采样步骤变为： \[ \mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) \] 这是确定性的！ 没有随机噪声 $\mathbf{z}$ 的介入。

这有什么用？ 这意味着从一个固定的 $\mathbf{x}_T$ 出发，无论运行多少次，总会生成完全相同的 $\mathbf{x}_0$。这使得扩散模型可用于图像编辑、风格转换等需要保持一致性的任务。

DDIM 论文推导出了一个更通用的采样公式，它不依赖于 $\boldsymbol{\mu}_\theta$ 而是直接使用 $\hat{\mathbf{x}}_0$ 和 $\boldsymbol{\epsilon}_\theta$。当 $\eta=0$ (即 $\sigma_t = 0$) 时，从 $\mathbf{x}_t$ 到 $\mathbf{x}_{t-1}$ 的确定性采样公式为：

\[ \mathbf{x}_{t-1} = \underbrace{\sqrt{\bar{\alpha}_{t-1}} \hat{\mathbf{x}}_0}_{\text{指向“预测的” } \mathbf{x}_0} + \underbrace{\sqrt{1 - \bar{\alpha}_{t-1}} \cdot \left( \frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \hat{\mathbf{x}}_0}{\sqrt{1 - \bar{\alpha}_t}} \right)}_{\text{指向“当前的” } \mathbf{x}_t \text{ 的方向}} \] (注意：$\frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \hat{\mathbf{x}}_0}{\sqrt{1 - \bar{\alpha}_t}}$ 正好等于 $\boldsymbol{\epsilon}_\theta$ ) 所以，确定性（$\eta=0$）的 DDIM 采样步骤也可以写为： \[ \mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \hat{\mathbf{x}}_0 + \sqrt{1 - \bar{\alpha}_{t-1}} \cdot \boldsymbol{\epsilon}_\theta \]

6.3 升级 2：跳步采样 (Skip Sampling) (对应)

DDPM 必须一步一步 $t \to t-1 \to t-2 \dots$ 地采样，因为它是马尔可夫过程。

DDIM 的采样公式（如上所示）是非马尔可夫的。它不依赖于 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$，而是直接使用在 $t$ 时刻预测的 $\hat{\mathbf{x}}_0$ 来计算 $\mathbf{x}_{t-1}$。

关键洞察： 既然我们能从 $\mathbf{x}_t$ 预测出 $\hat{\mathbf{x}}_0$，我们不仅能计算 $\mathbf{x}_{t-1}$，我们能计算任意 $\mathbf{x}_{\tau}$ (其中 $\tau < t$)。

这使得跳步采样成为可能。我们不再需要完整的 $T=1000$ 步，我们可以定义一个更短的子序列，例如 $S=20$ 步： $(\tau_1, \tau_2, \dots, \tau_S) = (1, 51, 101, \dots, 951)$

我们的推理循环不再是 for t in (T...1)，而是 for i in (S...1)： * 当前步：$\tau_i$ (例如 $\tau_{20} = 951$) * 目标步：$\tau_{i-1}$ (例如 $\tau_{19} = 901$)

DDIM 跳步采样（确定性）公式： (对应)

输入： 当前噪声图像 $\mathbf{x}_{\tau_i}$ 和时间步 $\tau_i$。
预测 $\hat{\mathbf{x}}_0$： (与之前相同) \[ \hat{\mathbf{x}}_0 = \frac{1}{\sqrt{\bar{\alpha}_{\tau_i}}} \left( \mathbf{x}_{\tau_i} - \sqrt{1 - \bar{\alpha}_{\tau_i}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_{\tau_i}, \tau_i) \right) \]
计算 $\mathbf{x}_{\tau_{i-1}}$： (使用确定性公式，将 $t-1$ 替换为 $\tau_{i-1}$) \[ \mathbf{x}_{\tau_{i-1}} = \sqrt{\bar{\alpha}_{\tau_{i-1}}} \hat{\mathbf{x}}_0 + \sqrt{1 - \bar{\alpha}_{\tau_{i-1}}} \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_{\tau_i}, \tau_i) \]

结果： 我们不再需要 1000 步计算，而是通过跳步，仅用 20 步就完成了从 $\mathbf{x}_T$ 到 $\mathbf{x}_0$ 的生成。这极大地（例如 50 倍）提升了推理速度。

小结

DDIM 是对 DDPM 的一次重大升级，它通过引入 $\hat{\mathbf{x}}_0$ 预测和非马尔可夫采样，实现了： 1. 确定性推理（$\eta=0$），增强了模型的可控性。 2. 跳步采样，极大缩短了推理时间。 3. 最重要的是，它复用了 DDPM 训练好的 $\boldsymbol{\epsilon}_\theta$ 模型，无需额外训练。

好的，我们来探讨扩散模型演进的下一个重要阶段：流匹配 (Flow Matching)。

在 DDPM 和 DDIM 中，我们都依赖于一个离散时间的 SDE（随机微分方程）或其确定性版本。我们模拟了 $T$ 个离散步骤（例如 $T=1000$），这在数学上是有效的，但在概念上有些繁琐，并且依赖于 $\beta_t$ (或 $\bar{\alpha}_t$) 这个人工设计的“噪声表”。

流匹配 (Flow Matching, FM) 模型（2022年及后续工作）提出了一种更简洁、更根本的视角：连续时间的常微分方程 (ODE)。

这对应于您大纲中的 - 部分。

7. 流匹配：连续轨迹与速度预测

7.1 核心思想：从 SDE 到 ODE

DDPM (SDE)：将图像 $\mathbf{x}_0$ 变为噪声 $\mathbf{x}_T$ 的过程是随机的（$\mathbf{x}_t = \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{\beta_t} \boldsymbol{\epsilon}$）。
流匹配 (ODE)：我们构建一个确定性的、连续的“流”，将纯噪声 $\mathbf{z}$（我们这里称为 $\mathbf{x}_0$）平滑地转变为清晰图像 $\mathbf{x}_1$。

我们不再考虑离散步骤 $t=1, 2, \dots, T$，而是考虑一个连续时间 $t \in [0, 1]$：

$t=0$：$\mathbf{x}_0$ 是从 $\mathcal{N}(0, \mathbf{I})$ 采样的纯噪声。
$t=1$：$\mathbf{x}_1$ 是我们想要生成的清晰图像。

这个从 $\mathbf{x}_0$ 到 $\mathbf{x}_1$ 的连续演变路径 $\mathbf{x}_t$ 由一个常微分方程 (ODE) 描述：

\[ \frac{d\mathbf{x}_t}{dt} = \mathbf{v}(\mathbf{x}_t, t) \] * $\mathbf{v}(\mathbf{x}_t, t)$ 是一个速度向量场 (velocity vector field)。 * 它告诉我们：当一个点位于位置 $\mathbf{x}_t$ 和时间 $t$ 时，它应该往哪个方向（向量）以多快的速度（模长）移动。 * 训练目标： 我们的神经网络 $\mathbf{v}_\theta(\mathbf{x}_t, t)$ 的目标就是学习这个速度场 $\mathbf{v}$，而不是像 DDPM 那样学习噪声 $\boldsymbol{\epsilon}$。

7.2 训练：学习速度场 (对应)

问题： 理论上存在一个理想的速度场 $\mathbf{v}$ 可以将噪声分布“推向”图像分布，但这个理想的 $\mathbf{v}$ 非常复杂，我们无法知道。

流匹配的创见： 我们不需要知道那个复杂的理想场。我们可以自己定义无数条简单的路径，然后训练网络来学习这些简单路径的平均速度。

1. 定义简单路径（直线模型）： 给定一个噪声 $\mathbf{x}_0 \sim \mathcal{N}(0, \mathbf{I})$ 和一张真实图像 $\mathbf{x}_1 \sim q(\text{data})$，连接它们的最简单路径是什么？一条直线。

\[\\mathbf{x}\_t = (1 - t) \\mathbf{x}\_0 + t \\mathbf{x}\_1 \] * 当 $t=0$ 时，$\mathbf{x}_t = \mathbf{x}_0$ (噪声)。

当 $t=1$ 时，$\mathbf{x}_t = \mathbf{x}_1$ (图像)。

2. 计算目标速度： 如果我们的“粒子” $\mathbf{x}_t$ 沿着这条直线路径运动，它在 $t$ 时刻的速度 $\mathbf{v}_t$ 是多少？我们对 $t$ 求导：

\[ \mathbf{v}_t = \frac{d\mathbf{x}_t}{dt} = \frac{d}{dt} \left( (1 - t) \mathbf{x}_0 + t \mathbf{x}_1 \right) \]\[ \mathbf{v}_t = -\mathbf{x}_0 + \mathbf{x}_1 = \mathbf{x}_1 - \mathbf{x}_0 \]这就是流匹配的训练目标！ 沿着这条直线路径，在任何时间 $t$，目标速度都是恒定的 $\mathbf{x}_1 - \mathbf{x}_0$。

3. 训练流程：

从数据集中随机抽取一张清晰图像 $\mathbf{x}_1$。
随机采样一个标准高斯噪声 $\mathbf{x}_0 \sim \mathcal{N}(0, \mathbf{I})$。
随机选择一个时间 $t$（从 $U(0, 1)$ 均匀采样）。
使用直线公式一步计算出路径上的点：$\mathbf{x}_t = (1 - t) \mathbf{x}_0 + t \mathbf{x}_1$。
将 $(\mathbf{x}_t, t)$ 作为输入，喂给神经网络 $\mathbf{v}_\theta(\mathbf{x}_t, t)$，得到预测速度 $\mathbf{v}_\theta$。
计算损失函数（均方误差 MSE）： $$

1
2
3

$$L(\\theta) = E\_{t, \\mathbf{x}\_0, \\mathbf{x}\_1} \\left[ || (\\mathbf{x}\_1 - \\mathbf{x}*0) - \\mathbf{v}*\\theta(\\mathbf{x}\_t, t) ||^2 \\right]
$$
$$

使用梯度下降更新网络参数 $\theta$。

7.3 推理：求解 ODE (对应)

当我们训练好 $\mathbf{v}_\theta$ 后，我们就有了一个完整的速度场，它知道在时空中的任何点 $(\mathbf{x}, t)$ 应该如何移动。

问题： 如何从 $\mathbf{x}_0$ 积分到 $\mathbf{x}_1$？我们需要求解 ODE $\frac{d\mathbf{x}_t}{dt} = \mathbf{v}_\theta(\mathbf{x}_t, t)$，从 $t=0$ 求解到 $t=1$。

方法： 我们使用数值积分方法，最简单的就是欧拉近似法（我们在第 1 部分提到过）。

起始： 随机采样一张纯噪声图像 $\mathbf{x}_0 \sim \mathcal{N}(0, \mathbf{I})$。
离散化： 将时间 $[0, 1]$ 分为 $N$ 步（例如 $N=20$），每一步 $\Delta t = 1/N$。
迭代： 从 $t = 0$ 循环到 $t = 1 - \Delta t$：
1. 获取当前位置 $\mathbf{x}_t$ 和时间 $t$。
2. 输入网络，得到当前速度：$\mathbf{v}_\theta = \mathbf{v}_\theta(\mathbf{x}_t, t)$。
3. 欧拉法更新： \[ \]\[\\mathbf{x}\_{t + \\Delta t} = \\mathbf{x}*t + \\mathbf{v}*\\theta \\cdot \\Delta t \] $$$$(新位置 = 旧位置 + 速度 × 时间)
结束： 当循环结束时，$\mathbf{x}_1$ 就是生成的清晰图像。

优势：

更简单： 训练目标 $\mathbf{x}_1 - \mathbf{x}_0$ 非常直观，摆脱了 DDPM 中复杂的 $\bar{\alpha}_t, \beta_t$ 系数。
更高效： ODE 路径通常比 SDE 路径“更直”，因此流匹配通常可以用更少的推理步骤（例如 10-50 步）生成高质量图像。
更灵活： 我们可以使用比欧拉法更高级的 ODE 求解器（如 Runge-Kutta）来进一步提高精度和速度。

总结

扩散模型从基础到前沿的全部核心数学：

基础 (Part 1)：高斯分布、朗之万动力学 (SDE) 和贝叶斯公式。
DDPM (Part 2-5)：
- 前向 (q)：$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$
- 反向 (p)：$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的推导。
- 训练：预测噪声 $L = || \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) ||^2$。
- 推理：$\mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}$ (随机，T 步)。
DDIM (Part 6)：
- 核心：预测 $\hat{\mathbf{x}}_0$。
- 推理：$\mathbf{x}_{\tau_{i-1}} = \dots$ (确定性，可跳步)。
流匹配 (Part 7)：
- 核心：ODE 连续流 $\frac{d\mathbf{x}_t}{dt} = \mathbf{v}(\mathbf{x}_t, t)$。
- 训练：预测速度 $L = || (\mathbf{x}_1 - \mathbf{x}_0) - \mathbf{v}_\theta(\mathbf{x}_t, t) ||^2$。
- 推理：$\mathbf{x}_{t + \Delta t} = \mathbf{x}_t + \mathbf{v}_\theta \cdot \Delta t$ (ODE 求解)。

MSDM 5054 - Statistical Machine Learning-L6

发表于 2025-10-13 更新于 2025-10-20 分类于 Machine Learning

统计机器学习Lecture-6

Lecturer: Prof.XIA DONG

1. Linear Model Selection and Regularization 线性模型选择与正则化

Summary of Core Concepts

Chapter 6: Linear Model Selection and Regularization, focusing specifically on Section 6.1: Subset Selection. 第六章：线性模型选择与正则化，6.1节：子集选择

The Problem: You have a dataset with many potential predictor variables (features). If you include all of them (like Model 1 with $p$ predictors in slide ...221320.png), you risk including “noise” variables. These irrelevant features can decrease model accuracy (overfitting) and make the model difficult to interpret. 数据集包含许多潜在的预测变量（特征）。如果包含所有这些变量（例如幻灯片“…221320.png”中带有$p$个预测变量的模型1），则可能会包含“噪声”变量。这些不相关的特征会降低模型的准确率（过拟合），并使模型难以解释。
The Goal: Identify a smaller subset of variables that are truly related to the response. This creates a simpler, more interpretable, and often more accurate model (like Model 2 with $q$ predictors). 找出一个与响应真正相关的较小变量子集。这将创建一个更简单、更易于解释且通常更准确的模型（例如带有$q$个预测变量的模型2）。
The Main Method Discussed: Best Subset Selection
主要讨论的方法：最佳子集选择 This is an exhaustive search algorithm. It checks every possible combination of predictors to find the “best” model. With $p$ variables, this means checking $2^p$ total models. 这是一种穷举搜索算法。它检查所有可能的预测变量组合，以找到“最佳”模型。对于 $p$ 个变量，这意味着需要检查总共 $2^p$ 个模型。

The algorithm (from slide ...221333.png) works in three steps:
1. Step 1: Fit the “null model” $M_0$, which has no predictors (it just predicts the average of the response). 拟合“空模型”$M_0$，它没有预测变量（它只预测响应的平均值）。
2. Step 2: For each $k$ (from 1 to $p$):
  - Fit all $\binom{p}{k}$ models that contain exactly $k$ predictors. (e.g., fit all models with 1 predictor, then all models with 2 predictors, etc.).
  - 拟合所有包含 $k$ 个预测变量的 $\binom{p}{k}$ 个模型。（例如，先拟合所有包含 1 个预测变量的模型，然后拟合所有包含 2 个预测变量的模型，等等）。
  - From this group, select the single best model for that size $k$. This “best” model is the one with the highest $R^2$ (or lowest RSS - Residual Sum of Squares) on the training data. Call this model $M_k$.
  - 从这组中，选择 对于该规模 $k$ 的最佳模型。这个“最佳”模型是在 训练数据 上具有最高 $R^2$（或最低 RSS - 残差平方和）的模型。将此模型称为 $M_k$。
3. Step 3: You now have $p+1$ models: $M_0, M_1, \dots, M_p$. You must select the single best one from this list. To do this, you cannot use training $R^2$ (as it will always pick the biggest model $M_p$). Instead, you must use a metric that estimates test error, such as: 现在你有 $p+1$ 个模型：$M_0, M_1, \dots, M_p$。你必须从列表中选择一个最佳模型。为此，你不能**使用训练 $R^2$（因为它总是会选择最大的模型 $M_p$）。相反，你必须使用一个能够估计测试误差的指标，例如：
  - Cross-Validation (CV) 交叉验证 (CV) (This is what the Python code uses)
  - AIC (Akaike Information Criterion 赤池信息准则)
  - BIC (Bayesian Information Criterion 贝叶斯信息准则)
  - Adjusted $R^2$ 调整后的 $R^2$
Key Takeaway: The slides show this “subset selection” concept can be applied beyond linear models. The Python code demonstrates this by applying best subset selection to a K-Nearest Neighbors (KNN) Regressor, a non-linear model.“子集选择”的概念可以应用于线性模型之外。

Mathematical Understanding & Key Questions 数学理解与关键问题

This section directly answers the questions posed on your slides.

How to compare which model is better?

(From slides ...221320.png and ...221326.png)

You cannot use training error (like $R^2$ or RSS) to compare models with different numbers of predictors. A model with more predictors will almost always have a better training score, even if those extra predictors are just noise. This is called overfitting. 不能使用训练误差（例如 $R^2$ 或 RSS）来比较具有不同数量预测变量的模型。具有更多预测变量的模型几乎总是具有更好的训练分数，即使这些额外的预测变量只是噪声。这被称为过拟合。

To compare models of different sizes (like Model 1 vs. Model 2, or $M_2$ vs. $M_5$), you must use a method that estimates test error (how the model performs on new, unseen data). The slides mention: 要比较不同大小的模型（例如模型 1 与模型 2，或 $M_2$ 与 $M_5$），您必须使用一种估算测试误差（模型在新的、未见过的数据上的表现）的方法。

Cross-Validation (CV): This is the gold standard. You split your data into “folds,” train the model on some folds, and test it on the remaining fold. You repeat this and average the test scores. The model with the best (e.g., lowest) average CV error is chosen. 将数据分成“折叠”，在一些折叠上训练模型，然后在剩余的折叠上测试模型。重复此操作并取测试分数的平均值。选择平均 CV 误差最小（例如，最小）的模型。
AIC & BIC: These are mathematical adjustments to the training error (like RSS) that add a penalty for having more predictors. They balance model fit with model complexity. 这些是对训练误差（如 RSS）的数学调整，会因预测变量较多而增加惩罚。它们平衡了模型拟合度和模型复杂度。

Why use $R^2$ in Step 2?

(From slide ...221333.png)

In Step 2, you are only comparing models of the same size (i.e., all models that have exactly $k$ predictors). For models with the same number of parameters, a higher $R^2$ (or lower RSS) on the training data directly corresponds to a better fit. You don’t need to penalize for complexity because all models being compared have the same complexity. 只比较大小相同的模型（即所有恰好具有 $k$ 个预测变量的模型）。对于参数数量相同的模型，训练数据上更高的 $R^2$（或更低的 RSS）直接对应着更好的拟合度。您不需要对复杂度进行惩罚，因为所有被比较的模型都具有相同的复杂度。

Why can’t we use training error in Step 3?

(From slide ...221333.png)

In Step 3, you are comparing models of different sizes ($M_0$ vs. $M_1$ vs. $M_2$, etc.). As you add predictors, the training $R^2$ will always go up (or stay the same), and the training RSS will always go down (or stay the same). If you used $R^2$ to pick the best model in Step 3, you would always pick the most complex model $M_p$, which is almost certainly overfit. 将比较不同大小的模型（例如 $M_0$ vs. $M_1$ vs. $M_2$ 等）。随着您添加预测变量，训练 $R^2$ 将始终上升（或保持不变），而训练 RSS 将始终下降（或保持不变）。如果您在步骤 3 中使用 $R^2$ 来选择最佳模型，那么您始终会选择最复杂的模型 $M_p$，而该模型几乎肯定会过拟合。

Therefore, you must use a metric that estimates test error (like CV) or penalizes for complexity (like AIC, BIC, or Adjusted $R^2$) to find the right balance between fit and simplicity. 因此，您必须使用一个可以估算测试误差（例如 CV）或惩罚复杂度（例如 AIC、BIC 或调整后的 $R^2$）的指标来找到拟合度和简单性之间的平衡。

Code Analysis

The Python code (slides ...221249.jpg and ...221303.jpg) implements the Best Subset Selection algorithm using KNN Regression.

Key Functions

main():
1. Loads Data: Reads the Credit.csv file.
2. Preprocesses Data:
  - Converts categorical features (‘Gender’, ‘Student’, ‘Married’, ‘Ethnicity’) into numerical ones (dummy variables). 将分类特征（“性别”、“学生”、“已婚”、“种族”）转换为数值特征（虚拟变量）。
  - Creates the feature matrix X and target variable y (‘Balance’). 创建特征矩阵 X 和目标变量 y（“余额”）。
  - Scales the features using StandardScaler. This is crucial for KNN, which is sensitive to the scale of features. 用 StandardScaler 对特征进行缩放。这对于 KNN 至关重要，因为它对特征的缩放非常敏感。
3. Adds Noise (in the second example): Slide ...221303.jpg shows code that adds 20 new “noisy” columns to the data. This is to test if the selection algorithm is smart enough to ignore them. 向数据中添加 20 个新的“噪声”列的代码。这是为了测试选择算法是否足够智能，能够忽略它们。
4. Runs Selection: Calls best_subset_selection_parallel to do the main work.
5. Prints Results: Finds the best subset (lowest error) and prints the top 20 best-performing subsets. 找到最佳子集（误差最小），并打印出表现最佳的前 20 个子集。
6. Final Evaluation: It re-trains a KNN model on only the best subset and calculates the final cross-validated RMSE. 仅基于最佳子集重新训练 KNN 模型，并计算最终的交叉验证 RMSE。
evaluate_subset(subset, ...):
- This is the “worker” function. It’s called for every single possible subset.
- It takes a subset (a list of feature names, e.g., ['Income', 'Limit']).
- It creates a new X_subset containing only those columns.
- It runs 5-fold cross-validation (cross_val_score) on a KNN model using this X_subset.
- It uses 'neg_mean_squared_error' as the metric. This is negative MSE; a higher score (closer to 0) is better. 它会创建一个新的“X_subset”，仅包含这些列。它会使用此“X_subset”在 KNN 模型上运行 5 倍交叉验证（“cross_val_score”）。它使用“neg_mean_squared_error”作为度量标准。这是负 MSE；更高*的分数（越接近 0）越好。
- It returns the subset and its average CV score.
best_subset_selection_parallel(model, ...):
- This is the “manager” function.这是“管理器”函数。
- It iterates from k=1 up to the total number of features.它从“k=1”迭代到特征总数。
- For each k, it generates all combinations of features of that size (this is the $\binom{p}{k}$ part). 对于每个“k”，它会生成该大小的特征的所有组合（这是 $\binom{p}{k}$ 部分）。
- It uses Parallel and delayed (from joblib) to run evaluate_subset for all these combinations in parallel, speeding up the process significantly. 它使用 Parallel 和 delayed（来自 joblib）对所有这些组合并行运行 evaluate_subset，从而显著加快了处理速度。
- It collects all the results and returns them.它收集所有结果并返回。

Analysis of the Output

Slide ...221255.png (Original Data):
- The code runs subset selection on the original dataset.
- The “Top 20 Best Feature Subsets” are shown. The CV scores are negative (they are neg_mean_squared_error), so the scores closest to zero (smallest magnitude) are best.
- The Best feature subset is found to be ('Income', 'Limit', 'Rating', 'Student').
- The final cross-validated RMSE for this model is 105.41.
Slide ...221309.png (Data with 20 Noisy Variables):
- The code is re-run after adding 20 useless “Noisy” features.
- The algorithm still works. It correctly identifies that the “Noisy” variables are useless.
- The Best feature subset is now ('Income', 'Limit', 'Student'). (Note: ‘Rating’ was dropped, likely because it’s highly correlated with ‘Limit’, and the noisy data made the simpler model perform slightly better in CV).
- The final RMSE is 114.94. This is higher than the original 105.41, which is expected—the presence of so many noise variables makes the selection problem harder, but the final model is still good and, most importantly, it successfully excluded all 20 noisy features. 最终的 RMSE 为 114.94。这比最初的 105.41更高，这是预期的——如此多的噪声变量的存在使得选择问题更加困难，但最终模型仍然很好，最重要的是，它成功地排除了所有 20 个噪声特征。

Conceptual Overview: The “Why”

Slides cover Chapter 6: Linear Model Selection and Regularization, which is all about a fundamental trade-off in machine learning: the bias-variance trade-off. 该部分主要讨论机器学习中的一个基本权衡：偏差-方差权衡。

The Problem (Slide ...221320.png): Imagine you have a dataset with 50 predictors ($p=50$). You want to predict a response $y$. 假设你有一个包含 50 个预测变量（p=50）的数据集。你想要预测响应 $y$。
- Model 1 (Full Model): You use all 50 predictors. This model is very flexible. It will fit the training data extremely well, resulting in a low bias. However, it’s highly likely that many of those 50 predictors are just “noise” (random, unrelated variables). By fitting to this noise, the model will be overfit. When you show it new, unseen data (the test data), it will perform poorly. This is called high variance. 你使用了所有 50 个预测变量。这个模型非常灵活。它能很好地拟合训练数据，从而产生较低的偏差。然而，这 50 个预测变量中很可能有很多只是“噪声”（随机的、不相关的变量）。由于拟合这些噪声，模型会过拟合。当你向它展示新的、未见过的数据（测试数据）时，它的表现会很差。这被称为高方差。
- Model 2 (Subset Model): You intelligently select only the 3 predictors ($q=3$) that are actually related to $y$. This model is less flexible. It won’t fit the training data as perfectly as Model 1 (it has higher bias). But, because it’s not fitting the noise, it will generalize much better to new data. It will have a much lower variance, and thus a lower overall test error. 你智能地只选择与 $y$ 真正相关的 3 个预测变量 ($q=3$)。这个模型的灵活性较差。它对 训练数据 的拟合度不如模型 1 完美（它的偏差更高）。但是，由于它对噪声的拟合度更高，因此对新数据的泛化能力会更好。它的方差会更低，因此总体的测试误差也会更低。
The Goal: The goal is to find the model that has the lowest test error. We need a formal method to find the best subset (like Model 2) without just guessing. 目标是找到测试误差**最低的模型。我们需要一个正式的方法来找到最佳子集（例如模型 2），而不是仅仅靠猜测。
Two Main Strategies (Slide ...221314.png):
1. Subset Selection (Section 6.1): This is what we’re focused on. It’s an “all-or-nothing” approach. You either keep a variable in the model or you discard it completely. The “Best Subset Selection” algorithm is the most extreme, “brute-force” way to do this. 是我们关注的重点。这是一种“全有或全无”的方法。你要么在模型中“保留”一个变量，要么“彻底丢弃”它。“最佳子集选择”算法是最极端、最“暴力”的做法。
2. Shrinkage/Regularization (Section 6.2): This is a more subtle approach (e.g., Ridge Regression, LASSO). Instead of discarding variables, you keep all $p$ variables but add a penalty to the model that “shrinks” the coefficients ($\beta$) of the useless variables towards zero. 这是一种更巧妙的方法（例如，岭回归、LASSO）。你不是丢弃变量，而是保留所有 $p$ 个变量，但会给模型添加一个惩罚项，将无用变量的系数（$\beta$）“收缩”到零。

Questions 🎯

Q1: “How to compare which model is better?”

(From slides ...221320.png and ...221326.png)

This is the most important question. You cannot use metrics based on training data (like $R^2$ or RSS - Residual Sum of Squares) to compare models with different numbers of predictors. 这是最重要的问题。您不能使用基于训练数据的指标（例如 R^2 或 RSS - 残差平方和）来比较具有不同数量预测变量的模型。

The Trap: A model with more predictors will always have a higher $R^2$ (or lower RSS) on the data it was trained on. $R^2$ will always increase as you add variables, even if they are pure noise. If you used $R^2$ to compare a 3-predictor model to a 10-predictor model, the 10-predictor model would always look better on paper, even if it’s terribly overfit. 具有更多预测变量的模型在其训练数据上总是具有更高的 R^2（或更低的 RSS）。随着变量的增加，R^2 会总是增加，即使这些变量是纯噪声。如果您使用 R^2 来比较 3 个预测变量的模型和 10 个预测变量的模型，那么 10 个预测变量的模型在纸面上总是看起来更好，即使它严重过拟合。
The Correct Way: You must use a metric that estimates the test error. The slides and code show two ways:您必须使用一个能够估计测试误差的指标。
1. Cross-Validation (CV): This is the method used in your Python code. It works by:
  - Splitting your training data into $k$ “folds” (e.g., 5 folds). 将训练数据拆分成 $k$ 个“折叠”（例如 5 个折叠）。
  - Training the model on 4 folds and testing it on the 5th fold. 使用其中 4 个折叠训练模型，并使用第 5 个折叠进行测试。
  - Repeating this 5 times, so each fold gets to be the test set once. 重复此操作 5 次，使每个折叠都作为测试集一次。
  - Averaging the 5 test errors. 对 5 个测试误差求平均值。 This gives you a robust estimate of how your model will perform on unseen data. You then choose the model with the best (lowest) average CV error. 这可以让你对模型在未见数据上的表现有一个稳健的估计。然后，你可以选择平均 CV 误差最小（最佳）的模型。
2. Mathematical Adjustments (AIC, BIC, Adjusted $R^2$): These are formulas that take the training error (like RSS) and add a penalty for each predictor ($k$) you add.
  - $AIC \approx RSS + 2k\sigma^2$
  - $BIC \approx RSS + \log(n)k\sigma^2$ A model with more predictors (larger $k$) gets a bigger penalty. To be chosen, a more complex model must significantly improve the RSS to overcome this penalty. 预测变量越多（k 越大）的模型，惩罚越大。要被选中，更复杂的模型必须显著提升 RSS 以克服此惩罚。

Q2: “Why using $R^2$ for step 2?”

(From slide ...221333.png)

Step 2 of the “Best Subset Selection” algorithm says: “For $k = 1, \dots, p$: Fit all $\binom{p}{k}$ models… Pick the best model, that with the largest $R^2$, … and call it $M_k$.” “对于 $k = 1, \dots, p$：拟合所有 $\binom{p}{k}$ 个模型……选择具有最大 $R^2$ 的最佳模型……并将其命名为 $M_k$。”

The Reason: In Step 2, you are only comparing models of the same size. For example, when $k=3$, you are comparing all possible 3-predictor models: 步骤 2 中，您仅比较**相同大小的模型。例如，当 $k=3$ 时，您将比较所有可能的 3 预测变量模型：
- Model A: ($X_1, X_2, X_3$)
- Model B: ($X_1, X_2, X_4$)
- Model C: ($X_1, X_3, X_5$)
- …and so on.
Since all these models have the exact same complexity (they all have $k=3$ predictors), there is no risk of unfairly favoring a more complex model. Therefore, you are free to use a training metric like $R^2$ (or RSS). The model with the highest $R^2$ is, by definition, the one that best fits the training data for that specific size $k$. 由于所有这些模型都具有完全相同的复杂度（它们都具有 $k=3$ 个预测变量），因此不存在不公平地偏向更复杂模型的风险。因此，您可以自由使用像 $R^2$（或 RSS）这样的训练指标。根据定义，具有最高 $R^2$ 的模型就是在特定大小 $k$ 下与训练数据拟合度最高的模型。

Q3: “Cannot use training error in Step 3.” Why not? “步骤 3 中不能使用训练误差。” 为什么？

(From slide ...221333.png)

Step 3 says: “Select a single best model from $M_0, M_1, \dots, M_p$ by cross validation, AIC, or BIC.”“通过交叉验证、AIC 或 BIC，从 $M_0、M_1、\dots、M_p$ 中选择一个最佳模型。”

The Reason: In Step 3, you are now comparing models of different sizes. You are comparing the best 1-predictor model ($M_1$) vs. the best 2-predictor model ($M_2$) vs. the best 3-predictor model ($M_3$), and so on, all the way up to $M_p$. 在步骤 3 中，您正在比较不同大小的模型。您正在比较最佳的单预测模型 ($M_1$)、最佳的双预测模型 ($M_2$) 和最佳的三预测模型 ($M_3$)，依此类推，直到 $M_p$。

As explained in Q1, if you used a training error metric like $R^2$ here, the $R^2$ would just keep going up, and you would always select the largest, most complex model, $M_p$. This completely defeats the purpose of model selection. 如问题 1 所述，如果您在此处使用像 $R^2$ 这样的训练误差指标，那么 $R^2$ 会持续上升，并且您总是会选择最大、最复杂的模型 $M_p$。这完全违背了模型选择的目的。

Therefore, in Step 3, you must use a method that estimates test error (like Cross-Validation) or one that penalizes for complexity (like AIC or BIC) to find the “sweet spot” model that balances fit and simplicity. 因此，在步骤 3 中，您必须使用一种估算测试误差的方法（例如交叉验证）或惩罚复杂性的方法（例如 AIC 或 BIC），以找到在拟合度和简单性之间取得平衡的“最佳点”模型。

Mathematical Deep Dive 🧮

$Y = \beta_0 + \beta_1X_1 + \dots + \beta_pX_p + \epsilon$: The full linear model. The goal of subset selection is to find a subset of $X_j$’s where $\beta_j \neq 0$ and set all other $\beta$’s to 0. 完整的线性模型。子集选择的目标是找到 $X_j$ 的一个子集，其中 $_j 等于 0，并将所有其他 $\beta$ 设置为 0。
$2^p$ combinations: (Slide ...221333.png) This is the total number of models you have to check. For each of the $p$ variables, you have two choices: either it is IN the model or it is OUT.这是你需要检查的模型总数。对于每个 $p$ 个变量，你有两个选择：要么它在模型内部，要么它在模型外部。
- Example: $p=3$ (variables $X_1, X_2, X_3$)
- The $2^3 = 8$ possible models are:
  1. {} (The null model, $M_0$)
  2. { $X_1$ }
  3. { $X_2$ }
  4. { $X_3$ }
  5. { $X_1, X_2$ }
  6. { $X_1, X_3$ }
  7. { $X_2, X_3$ }
  8. { $X_1, X_2, X_3$ } (The full model, $M_3$)
- This is why this method is called an “exhaustive search”. It literally checks every single one. For $p=20$, $2^{20}$ is over a million models!这就是该方法被称为“穷举搜索”的原因。它实际上会检查每一个模型。对于 $p=20$，$2^{20}$ 就超过一百万个模型！
$\binom{p}{k} = \frac{p!}{k!(p-k)!}$: (Slide ...221333.png) This is the “combinations” formula. It tells you how many models you fit in Step 2 for a specific $k$.这是“组合”公式。它告诉你，对于特定的 $k$，在步骤 2中，你拟合了多少个模型。
- Example: $p=10$ total predictors.
- For $k=1$: You fit $\binom{10}{1} = 10$ models.
- For $k=2$: You fit $\binom{10}{2} = \frac{10 \times 9}{2 \times 1} = 45$ models.
- For $k=3$: You fit $\binom{10}{3} = \frac{10 \times 9 \times 8}{3 \times 2 \times 1} = 120$ models.
- …and so on. The sum of all these $\binom{p}{k}$ from $k=0$ to $k=p$ equals $2^p$.

Detailed Code Analysis 💻

Your slides show Python code that applies the Best Subset Selection algorithm to a KNN Regressor. This is a great example of how the selection algorithm is independent of the model type (as mentioned in slide ...221314.png).

Key Functions

main()
1. Load & Preprocess: Reads Credit.csv. The most important step here is converting categorical text (like ‘Male’/‘Female’) into numbers (1/0).
2. Scale Data: scaler = StandardScaler() and X_scaled = scaler.fit_transform(X).
  - WHY? This is CRITICAL for KNN. KNN works by measuring distance. If ‘Income’ (e.g., 50,000) is on a vastly different scale than ‘Cards’ (e.g., 3), the ‘Income’ feature will completely dominate the distance calculation, making ‘Cards’ irrelevant. Scaling resizes all features to have a mean of 0 and standard deviation of 1, so they all contribute fairly.
3. Handle Noisy Data (Slide ...221303.jpg): This version of the code intentionally adds 20 columns of useless, random numbers. This is a test to see if the algorithm is smart enough to ignore them.
4. Run Selection: results_df = best_subset_selection_parallel(...). This function does all the heavy lifting (explained next).
5. Find Best Model: results_df.sort_values(by='CV_Score', ascending=False).
  - WHY ascending=False? The code uses the metric 'neg_mean_squared_error'. This is MSE, but negative (e.g., -15000). A better model has an error closer to 0 (e.g., -10000). Since -10000 is greater than -15000, you sort in descending (high-to-low) order to put the best models at the top.
6. Final Evaluation (Step 3): final_scores = cross_val_score(knn, X_best, y, ...)
  - This is the implementation of Step 3. It takes only the single best subset (X_best) and runs a new cross-validation on it. This gives a final, unbiased estimate of how good that one model is.
7. Print RMSE: final_rmse = np.sqrt(-final_scores). It converts the negative MSE back into a positive RMSE (Root Mean Squared Error), which is in the same units as the target $y$ (in this case, ‘Balance’ in dollars).
best_subset_selection_parallel(model, ...)
1. This is the “manager” function. It implements the loop from Step 2.
2. for k in range(1, n_features + 1): This is the loop “For $k = 1, \dots, p$”.
3. subsets = list(combinations(feature_names, k)): This generates the $\binom{p}{k}$ combinations for the current $k$.
4. results = Parallel(n_jobs=n_jobs)(...): This is a non-core, “speed-up” command. It uses the joblib library to run the evaluations on all your computer’s CPU cores at once (in parallel). Without this, checking millions of models would take days.
5. subset_scores = ... [delayed(evaluate_subset)(...) ...] This line farms out the actual work to the evaluate_subset function for every single subset.
evaluate_subset(subset, ...)
1. This is the “worker” function. It gets called thousands or millions of times.
2. Its job is to evaluate one single subset (e.g., ('Income', 'Limit', 'Student')).
3. X_subset = X[list(subset)]: It slices the data to get only these columns.
4. scores = cross_val_score(model, X_subset, ...): This is the most important line. It takes the subset and performs a full 5-fold cross-validation on it.
5. return (subset, np.mean(scores)): It returns the subset and its average CV score.

Summary of Outputs (Slides `...221255.png` & `...221309.png`)

Original Data (Slide ...221255.png):
- Best Subset: ('Income', 'Limit', 'Rating', 'Student')
- Final RMSE: ~105.4
Data with 20 “Noisy” Variables (Slide ...221309.png):
- Best Subset: ('Income', 'Limit', 'Student')
- Result: The algorithm successfully identified that all 20 “Noisy” variables were useless and excluded every single one of them from the best models.
- Final RMSE: ~114.9
- Key Takeaway: The RMSE is slightly higher, which makes sense because the selection problem was much harder. But the method worked perfectly. It filtered all the “noise” and found a simple, powerful model, just as the theory on slide ...221320.png predicted.

2. The Core Problem: Training Error vs. Test Error 核心问题：训练误差 vs. 测试误差

The central theme of these slides is finding the “best” model. The problem is that a model with more predictors (more complex) will always fit the data it was trained on better. This is a trap. 寻找“最佳”模型。问题在于，预测因子越多（越复杂）的模型总是能更好地拟合训练数据。这是一个陷阱。

Training Error: How well the model fits the data we used to build it. $R^2$ and $RSS$ measure this. 模型与我们构建模型时所用数据的拟合程度。$R^2$ 和 $RSS$ 衡量了这一点。
Test Error: How well the model predicts new, unseen data. This is what we actually care about. A model that is too complex (e.g., has 10 predictors when only 3 are useful) will have low training error but very high test error. This is called overfitting. 模型预测新的、未见过的数据的准确程度。这才是我们真正关心的。过于复杂的模型（例如，有 10 个预测因子，但只有 3 个有用）的训练误差会很低，但测试误差会很高。这被称为过拟合。

The goal is to choose a model that has the lowest test error. The metrics below (Adjusted $R^2$, AIC, BIC) are all attempts to estimate this test error without having to actually collect new data. They do this by adding a penalty for complexity. 目标是选择一个具有最低测试误差的模型。以下指标（调整后的 $R^2$、AIC、BIC）都是在无需实际收集新数据的情况下尝试估计此测试误差。他们通过增加复杂度惩罚来实现这一点。

Basic Metrics (Measures of Fit)

These formulas from slide 13 describe how well a model fits the training data.

Residue (Error) 残差（误差）

Formula: $\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_j x_{ij}$
Concept: This is the most basic building block. It’s the difference between the actual observed value ($y_i$) and the value your model predicted ($\hat{y}_i$). It is the “error” for a single data point. 这是最基本的构建块。它是实际观测值 ($y_i$) 与模型*预测值 ($\hat{y}_i$) 之间的差值。它是单个数据点的“误差”。

Residual Sum of Squares (RSS) 残差平方和 (RSS)

Formula: $RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^2$
Concept: This is the overall measure of model error. You square all the individual errors (residues) to make them positive and then add them all up. 这是模型误差的总体度量。将所有单个误差（残差）平方，使其为正，然后将它们全部相加。
Goal: The entire process of linear regression (called “Ordinary Least Squares”) is designed to find the $\hat{\beta}$ coefficients that make this RSS value as small as possible. 整个线性回归过程（称为“普通最小二乘法”）旨在找到使RSS 值尽可能小的 $\hat{\beta}$ 个系数。
The Flaw 缺陷: $RSS$ will always decrease (or stay the same) as you add more predictors ($p$). A model with all 10 predictors will have a lower $RSS$ than a model with 9, even if that 10th predictor is useless. Therefore, $RSS$ is useless for choosing between models of different sizes. 随着预测变量 ($p$) 的增加，$RSS$ 总是会减小（或保持不变）。一个包含所有 10 个预测变量的模型的 $RSS$ 会低于一个包含 9 个预测变量的模型，即使第 10 个预测变量毫无用处。因此，$RSS$ 对于在不同规模的模型之间进行选择毫无用处。

R-squared ($R^2$)

Formula: $R^2 = 1 - \frac{SS_{error}}{SS_{total}} = 1 - \frac{RSS}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$
Concept: This metric reframes $RSS$ into a more interpretable percentage.此指标将 $RSS$ 重新定义为更易于解释的百分比。
- $SS_{total}$ (the denominator) represents the total variance of the data. It’s the error you would get if your “model” was just guessing the average value ($\bar{y}$) for every single observation. （分母）表示数据的总方差。如果你的“模型”只是猜测每个观测值的平均值 ($\bar{y}$)，那么你就会得到这个误差。
- $SS_{error}$ (the $RSS$) is the error after using your model. 是“模型解释的总方差的比例”。 $R^2$ 为 0.75 意味着你的模型可以解释响应变量 75% 的变异。
- $R^2$ is the “proportion of total variance explained by the model.” An $R^2$ of 0.75 means your model can explain 75% of the variation in the response variable.
The Flaw 缺陷: Just like $RSS$, $R^2$ will always increase (or stay the same) as you add more predictors. This is visually confirmed in Figure 6.1, where the red line for $R^2$ only goes up. It will always pick the most complex model. 与 $RSS$ 一样，随着预测变量的增加，$R^2$ 会始终增加（或保持不变）。图 6.1 直观地证实了这一点，其中 $R^2$ 的红线只会上升。它总是会选择最复杂的模型。

Advanced Metrics (For Model Selection) 高级指标（用于模型选择）

These metrics “fix” the flaw of $R^2$ by including a penalty for the number of predictors.

Adjusted $R^2$

Formula: \[ \text{Adjusted } R^2 = 1 - \frac{RSS / (n - p - 1)}{SS_{total} / (n - 1)} \]
Mathematical Concept: This formula replaces the “Sum of Squares” ($SS$) with “Mean Squares” ($MS$).
- $MS_{error} = \frac{RSS}{n-p-1}$
- $MS_{total} = \frac{SS_{total}}{n-1}$
The “Penalty” Explained: The penalty is degrees of freedom.
- $n$ = number of data points.
- $p$ = number of predictors.
- The term $n-p-1$ is the degrees of freedom for the residuals. You start with $n$ data points, but you “use up” one degree of freedom to estimate the intercept ($\hat{\beta}_0$) and $p$ more to estimate the $p$ slopes.
How it Works:
1. When you add a new predictor (increase $p$), $RSS$ goes down, which makes the numerator ($MS_{error}$) smaller.
2. …But, increasing $p$ also decreases the denominator ($n-p-1$), which makes the numerator ($MS_{error}$) larger.
- This creates a “tug-of-war.” If the new predictor is useful, it will drop $RSS$ a lot, and Adjusted $R^2$ will increase. If the new predictor is useless, $RSS$ will barely change, and the penalty from decreasing the denominator will win, causing Adjusted $R^2$ to decrease.
Goal: You select the model with the highest Adjusted $R^2$.

Akaike Information Criterion (AIC)

General Formula: $AIC = -2 \log \ell(\hat{\theta}) + 2d$
Concept Breakdown:
- $\ell(\hat{\theta})$: This is the Maximized Likelihood Function.
  - The Likelihood Function $\ell(\theta)$ asks: “Given a set of model parameters $\theta$, how probable is the data we observed?”
  - The Maximum Likelihood Estimate (MLE) $\hat{\theta}$ is the specific set of parameters (the $\hat{\beta}$’s) that maximizes this probability.
- $\log \ell(\hat{\theta})$: The log-likelihood. This is just a number that represents the best possible fit the model can achieve for the data. A higher number is a better fit.
- $-2 \log \ell(\hat{\theta})$: This is the Deviance. Since a higher log-likelihood is better, a lower deviance is better. This term measures poorness-of-fit.
- $d$: The number of parameters estimated by the model. (e.g., $p$ predictors + 1 intercept).
- $2d$: This is the Penalty Term.
How it Works: $AIC = (\text{Poorness-of-Fit}) + (\text{Complexity Penalty})$. As you add predictors, the fit gets better (the deviance term goes down), but the penalty term ($2d$) goes up.
Goal: You select the model with the lowest AIC.

Bayesian Information Criterion (BIC)

General Formula: $BIC = -2 \log \ell(\hat{\theta}) + \log(n)d$
Concept: This is mathematically identical to AIC, but the penalty term is different.
- AIC Penalty: $2d$
- BIC Penalty: $\log(n)d$
Comparison:
- $n$ is the number of observations in your dataset.
- As long as your dataset has 8 or more observations ($n \ge 8$), $\log(n)$ will be greater than 2.
- This means BIC applies a much harsher penalty for complexity than AIC.
Consequence: BIC will tend to choose simpler models (fewer predictors) than AIC.
Goal: You select the model with the lowest BIC.

The Deeper Theory: Why AIC Works

Slide 27 (“Understanding AIC”) gives the deep mathematical justification.

Goal: We have a true, unknown process $p$ that generates our data. We are creating a model $\hat{p}_j$. We want our model to be as “close” to the truth as possible.
Kullback-Leibler (K-L) Distance: This is a function $K(p, \hat{p}_j)$ that measures the “information lost” when you use your model $\hat{p}_j$ to approximate the truth $p$. You want to minimize this distance.
The Math:
1. $K(p, \hat{p}_j) = \int p(y) \log \left( \frac{p(y)}{\hat{p}_j(y)} \right) dy$
2. This splits into: $K(p, \hat{p}_j) = \underbrace{\int p(y) \log(p(y)) dy}_{\text{Constant}} - \underbrace{\int p(y) \log(\hat{p}_j(y)) dy}_{\text{This is what we need to maximize}}$
The Problem: We can’t calculate that second term because it requires knowing the true function $p$.
Akaike’s Insight: Akaike proved that the log-likelihood we can calculate, $\log \ell(\hat{\theta})$, is a biased estimator of that target. He also proved that the bias is approximately $-d$.
The Solution: An unbiased estimate of the target is $\log \ell(\hat{\theta}) - d$.
Final Step: For historical and statistical reasons, he multiplied this by $-2$ to create the final AIC formula.
Conclusion: AIC is not just a random formula. It is a carefully derived estimate of how much information your model loses compared to the “truth” (i.e., its expected performance on new data).

AIC/BIC for Linear Regression

Slide 26 shows how these general formulas simplify for linear regression (assuming normal, Gaussian errors).

General Formula: $AIC = -2 \log \ell(\hat{\theta}) + 2d$
Linear Regression Formula: $AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)$

Key Insight: For linear regression, the “poorness-of-fit” term ($-2 \log \ell(\hat{\theta})$) is directly proportional to the $RSS$.

This makes it much easier to understand. You can just think of the formulas as: * AIC $\approx$ $RSS + 2d\hat{\sigma}^2$ * BIC $\approx$ $RSS + \log(n)d\hat{\sigma}^2$

(Here $\hat{\sigma}^2$ is an estimate of the error variance, which can often be treated as a constant).

This clearly shows the trade-off: We want a model with a low $RSS$ (good fit) and a low $d$ (low complexity). These two goals are in direct competition.

Mallow’s $C_p$: The slide notes that $C_p$ is equivalent to AIC for linear regression. The $C_p$ formula is $C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2_{full})$, where $\hat{\sigma}^2_{full}$ is the error variance estimated from the full model. Since $n$ and $\hat{\sigma}^2_{full}$ are constants, minimizing $C_p$ is mathematically identical to minimizing $RSS + 2d\hat{\sigma}^2_{full}$, which is the same logic as AIC.

Here is a detailed breakdown of the mathematical formulas and concepts from your slides.

The Core Problem: Training Error vs. Test Error

The central theme of these slides is finding the “best” model. The problem is that a model with more predictors (more complex) will always fit the data it was trained on better. This is a trap.

Training Error: How well the model fits the data we used to build it. $R^2$ and $RSS$ measure this.
Test Error: How well the model predicts new, unseen data. This is what we actually care about. A model that is too complex (e.g., has 10 predictors when only 3 are useful) will have low training error but very high test error. This is called overfitting.

Basic Metrics (Measures of Fit)

These formulas from slide 13 describe how well a model fits the training data.

Residue (Error)

Formula: $\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_j x_{ij}$
Concept: This is the most basic building block. It’s the difference between the actual observed value ($y_i$) and the value your model predicted ($\hat{y}_i$). It is the “error” for a single data point.

Residual Sum of Squares (RSS)

Formula: $RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^2$
Concept: This is the overall measure of model error. You square all the individual errors (residues) to make them positive and then add them all up.
Goal: The entire process of linear regression (called “Ordinary Least Squares”) is designed to find the $\hat{\beta}$ coefficients that make this RSS value as small as possible.
The Flaw: $RSS$ will always decrease (or stay the same) as you add more predictors ($p$). A model with all 10 predictors will have a lower $RSS$ than a model with 9, even if that 10th predictor is useless. Therefore, $RSS$ is useless for choosing between models of different sizes.

R-squared ($R^2$)

Formula: $R^2 = 1 - \frac{SS_{error}}{SS_{total}} = 1 - \frac{RSS}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$
Concept: This metric reframes $RSS$ into a more interpretable percentage.
- $SS_{total}$ (the denominator) represents the total variance of the data. It’s the error you would get if your “model” was just guessing the average value ($\bar{y}$) for every single observation.
- $SS_{error}$ (the $RSS$) is the error after using your model.
- $R^2$ is the “proportion of total variance explained by the model.” An $R^2$ of 0.75 means your model can explain 75% of the variation in the response variable.
The Flaw: Just like $RSS$, $R^2$ will always increase (or stay the same) as you add more predictors. This is visually confirmed in Figure 6.1, where the red line for $R^2$ only goes up. It will always pick the most complex model.

Advanced Metrics (For Model Selection)

These metrics “fix” the flaw of $R^2$ by including a penalty for the number of predictors.

Adjusted $R^2$

Formula: \[ \text{Adjusted } R^2 = 1 - \frac{RSS / (n - p - 1)}{SS_{total} / (n - 1)} \]
Mathematical Concept: This formula replaces the “Sum of Squares” ($SS$) with “Mean Squares” ($MS$).
- $MS_{error} = \frac{RSS}{n-p-1}$
- $MS_{total} = \frac{SS_{total}}{n-1}$
The “Penalty” Explained: The penalty is degrees of freedom.
- $n$ = number of data points.
- $p$ = number of predictors.
- The term $n-p-1$ is the degrees of freedom for the residuals. You start with $n$ data points, but you “use up” one degree of freedom to estimate the intercept ($\hat{\beta}_0$) and $p$ more to estimate the $p$ slopes.
How it Works:
1. When you add a new predictor (increase $p$), $RSS$ goes down, which makes the numerator ($MS_{error}$) smaller.
2. …But, increasing $p$ also decreases the denominator ($n-p-1$), which makes the numerator ($MS_{error}$) larger.
- This creates a “tug-of-war.” If the new predictor is useful, it will drop $RSS$ a lot, and Adjusted $R^2$ will increase. If the new predictor is useless, $RSS$ will barely change, and the penalty from decreasing the denominator will win, causing Adjusted $R^2$ to decrease.
Goal: You select the model with the highest Adjusted $R^2$.

Akaike Information Criterion (AIC)

General Formula: $AIC = -2 \log \ell(\hat{\theta}) + 2d$
Concept Breakdown:
- $\ell(\hat{\theta})$: This is the Maximized Likelihood Function.
  - The Likelihood Function $\ell(\theta)$ asks: “Given a set of model parameters $\theta$, how probable is the data we observed?”
  - The Maximum Likelihood Estimate (MLE) $\hat{\theta}$ is the specific set of parameters (the $\hat{\beta}$’s) that maximizes this probability.
- $\log \ell(\hat{\theta})$: The log-likelihood. This is just a number that represents the best possible fit the model can achieve for the data. A higher number is a better fit.
- $-2 \log \ell(\hat{\theta})$: This is the Deviance. Since a higher log-likelihood is better, a lower deviance is better. This term measures poorness-of-fit.
- $d$: The number of parameters estimated by the model. (e.g., $p$ predictors + 1 intercept).
- $2d$: This is the Penalty Term.
How it Works: $AIC = (\text{Poorness-of-Fit}) + (\text{Complexity Penalty})$. As you add predictors, the fit gets better (the deviance term goes down), but the penalty term ($2d$) goes up.
Goal: You select the model with the lowest AIC.

Bayesian Information Criterion (BIC)

General Formula: $BIC = -2 \log \ell(\hat{\theta}) + \log(n)d$
Concept: This is mathematically identical to AIC, but the penalty term is different.
- AIC Penalty: $2d$
- BIC Penalty: $\log(n)d$
Comparison:
- $n$ is the number of observations in your dataset.
- As long as your dataset has 8 or more observations ($n \ge 8$), $\log(n)$ will be greater than 2.
- This means BIC applies a much harsher penalty for complexity than AIC.
Consequence: BIC will tend to choose simpler models (fewer predictors) than AIC.
Goal: You select the model with the lowest BIC.

The Deeper Theory: Why AIC Works

Slide 27 (“Understanding AIC”) gives the deep mathematical justification.

Goal: We have a true, unknown process $p$ that generates our data. We are creating a model $\hat{p}_j$. We want our model to be as “close” to the truth as possible.
Kullback-Leibler (K-L) Distance: This is a function $K(p, \hat{p}_j)$ that measures the “information lost” when you use your model $\hat{p}_j$ to approximate the truth $p$. You want to minimize this distance.
The Math:
1. $K(p, \hat{p}_j) = \int p(y) \log \left( \frac{p(y)}{\hat{p}_j(y)} \right) dy$
2. This splits into: $K(p, \hat{p}_j) = \underbrace{\int p(y) \log(p(y)) dy}_{\text{Constant}} - \underbrace{\int p(y) \log(\hat{p}_j(y)) dy}_{\text{This is what we need to maximize}}$
The Problem: We can’t calculate that second term because it requires knowing the true function $p$.
Akaike’s Insight: Akaike proved that the log-likelihood we can calculate, $\log \ell(\hat{\theta})$, is a biased estimator of that target. He also proved that the bias is approximately $-d$.
The Solution: An unbiased estimate of the target is $\log \ell(\hat{\theta}) - d$.
Final Step: For historical and statistical reasons, he multiplied this by $-2$ to create the final AIC formula.
Conclusion: AIC is not just a random formula. It is a carefully derived estimate of how much information your model loses compared to the “truth” (i.e., its expected performance on new data).

AIC/BIC for Linear Regression

Slide 26 shows how these general formulas simplify for linear regression (assuming normal, Gaussian errors).

General Formula: $AIC = -2 \log \ell(\hat{\theta}) + 2d$
Linear Regression Formula: $AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)$

Key Insight: For linear regression, the “poorness-of-fit” term ($-2 \log \ell(\hat{\theta})$) is directly proportional to the $RSS$.

This makes it much easier to understand. You can just think of the formulas as: * AIC $\approx$ $RSS + 2d\hat{\sigma}^2$ * BIC $\approx$ $RSS + \log(n)d\hat{\sigma}^2$

(Here $\hat{\sigma}^2$ is an estimate of the error variance, which can often be treated as a constant).

This clearly shows the trade-off: We want a model with a low $RSS$ (good fit) and a low $d$ (low complexity). These two goals are in direct competition.

3. Variable Selection

Core Concept: The Problem of Variable Selection

In regression, we want to model a response variable $Y$ using a set of $p$ predictor variables $X_1, X_2, ..., X_p$.

The “Kitchen Sink” Problem: A common temptation is to include all available predictors in the model: \[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon\] This often leads to overfitting. The model may fit the training data well but will perform poorly on new, unseen data. It’s also hard to interpret a model with dozens of predictors.
The Solution: Subset Selection. The goal is to find a smaller subset of the predictors that builds a model that is:
1. Accurate: Has low prediction error.
2. Parsimonious: Uses the fewest predictors necessary.
3. Interpretable: Is simple enough for a human to understand.

Your slides present two main methods to achieve this: Best Subset Selection and Forward Stepwise Selection.

Method 1: Best Subset Selection (BSS)

This is the “brute force” approach. It considers every single possible model.

Conceptual Algorithm

Fit all models with $k=1$ predictor (there are $p$ of these). Find the best one (lowest RSS) and call it $M_1$.
Fit all models with $k=2$ predictors (there are $\binom{p}{2}$ of these). Find the best one and call it $M_2$.
…
Fit the one model with $k=p$ predictors (the full model), $M_p$.
You now have a list of $p$ “best” models: $M_1, M_2, ..., M_p$.
Use a selection criterion (like Adjusted $R^2$, BIC, AIC, or $C_p$) to choose the single best model from this list.

Mathematical & Computational Cost (from slide `225641.png`)

For each predictor, there are two possibilities: it’s either IN the model or OUT.
With $p$ predictors, the total number of models to test is $2 \times 2 \times ... \times 2$ ($p$ times).
Total Models = $2^p$
This is a “combinatorial explosion.” As the slide notes, if $p=20$, $2^{20} = 1,048,576$ models. This is computationally infeasible for large $p$.

Method 2: Forward Stepwise Selection (FSS)

This is a “greedy” algorithm. It’s an efficient alternative to BSS that does not test every model.

Conceptual Algorithm (from slides `225645.png` & `225648.png`)

Step 1: Start with the null model, $M_0$, which has no predictors. \[M_0: Y = \beta_0 + \epsilon\] The prediction is just the sample mean of $Y$.
Step 2 (Iterative):
- For $k=0$ (to get $M_1$): Fit all $p$ models that add one predictor to $M_0$. Choose the best one (lowest RSS or highest $R^2$). This is $M_1$. Let’s say it contains $X_1$.
- For $k=1$ (to get $M_2$): Keep $X_1$ in the model. Fit all $p-1$ models that add one more predictor to $M_1$ (e.g., $M_1+X_2$, $M_1+X_3$, …). Choose the best of these. This is $M_2$.
- Repeat: Continue this process, adding one variable at a time, until all $p$ predictors are in the model $M_p$.
Step 3: You now have a sequence of $p+1$ models: $M_0, M_1, ..., M_p$. Choose the single best model from this sequence using Adjusted $R^2$, AIC, BIC, or $C_p$.

Mathematical & Computational Cost (from slide `225651.png`)

To find $M_1$, you fit $p$ models.
To find $M_2$, you fit $p-1$ models.
To find $M_p$, you fit $1$ model.
The null model $M_0$ is 1 model.
Total Models = $1 + \sum_{k=0}^{p-1} (p-k) = 1 + p + (p-1) + ... + 1 = 1 + \frac{p(p+1)}{2}$
As the slide notes, if $p=20$, this is only $1 + 20(21)/2 = 211$ models. This is vastly more efficient than BSS.
Key weakness: The method is “greedy.” If it adds $X_1$ in Step 1, it can never be removed. It’s possible the true best 2-variable model is $(X_2, X_3)$, but if FSS chose $X_1$ as the best 1-variable model, it will never find $(X_2, X_3)$.

4. How to Choose the “Best” Model: The Criteria

You can’t use RSS or $R^2$ to compare models with different numbers of predictors ($k$). This is because RSS always decreases (and $R^2$ always increases) as you add more variables. You must use a criterion that penalizes complexity.

RSS (Residual Sum of Squares): Goal is to minimize. \[RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] Good for comparing models of the same size $k$.
Adjusted R-squared ($Adj. R^2$): Goal is to maximize. \[Adj. R^2 = 1 - \frac{(1-R^2)(n-1)}{n-p-1}\] This “adjusts” $R^2$ by adding a penalty for having more predictors ($p$). Adding a useless predictor will make $Adj. R^2$ go down.
Mallow’s $C_p$: Goal is to minimize. \[C_p \approx \frac{1}{n}(RSS + 2p\hat{\sigma}^2)\] Here, $\hat{\sigma}^2$ is an estimate of the error variance from the full model (with all $p$ predictors). A good model will have $C_p \approx p$.
AIC (Akaike Information Criterion) & BIC (Bayesian Information Criterion): Goal is to minimize. \[AIC = 2p - 2\ln(\hat{L})\] \[BIC = p\ln(n) - 2\ln(\hat{L})\] Here, $\hat{L}$ is the maximized likelihood of the model. You don’t need to calculate this by hand; software provides it.
- Key difference: BIC’s penalty for $p$ is $p\ln(n)$, while AIC’s is $2p$. Since $\ln(n)$ is almost always $> 2$ (for $n>7$), BIC applies a much heavier penalty for complexity.
- This means BIC tends to choose smaller, more parsimonious models than AIC or $Adj. R^2$.

5. Python Code Analysis (Slide `225546.jpg`)

This slide shows the Python code for Best Subset Selection (BSS).

# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from itertools import combinations # <-- This is the BSS engine

Block 1: Load the Credit dataset

# 1. Load the Credit dataset
Credit = pd.read_csv('Credit.csv')
Credit['ID'] = Credit['ID'].astype(str)
(num_samples, num_predictors) = Credit.shape

# Convert categorical text data to numerical (dummy variables)
Credit['Gender'] = Credit['Gender'].map({'Male': 1, 'Female': 0})
Credit['Student'] = Credit['Student'].map({'Yes': 1, 'No': 0})
Credit['Married'] = Credit['Married'].map({'Yes': 1, 'No': 0})
Credit['Ethnicity'] = Credit['Ethnicity'].map({'Asian': 1, 'Caucasian': 1, 'African American': 0})

pd.read_csv: Reads the data into a pandas DataFrame.
.map(): This is a crucial preprocessing step. Regression models require numbers, not text like ‘Yes’ or ‘Male’. This line converts those strings into 1s and 0s.

Block 2: Plot scatterplot matrix

# 2. Plot scatterplot matrix
selected_columns = ['Balance', 'Education', 'Age', 'Cards', 'Rating', 'Limit', 'Income']
sns.set(style="ticks")
sns.pairplot(Credit[selected_columns], diag_kind='kde')
plt.suptitle('Scatterplot Matrix', y=1.02)
plt.show()

sns.pairplot: A powerful visualization from the seaborn library. The resulting plot (right side of the slide) is a grid.
- Diagonal plots (kde): Show the distribution (Kernel Density Estimate) of a single variable (e.g., ‘Balance’ is skewed right).
- Off-diagonal plots (scatter): Show the relationship between two variables (e.g., ‘Limit’ and ‘Rating’ are almost perfectly linear). This helps you visually spot potentially strong predictors.

Block 3: Best Subset Selection

# 3. Best Subset Selection
# (This code is incomplete on the slide, I'll fill in the logic)

# Define target and predictors
target = 'Balance'
predictors = [col for col in Credit.columns if col != target] 
nvmax = 10 # Max number of predictors to test (up to 10)

# Initialize lists to store model statistics
model_stats = []

# Iterate over number of predictors from 1 to nvmax
for k in range(1, nvmax + 1):
    
    # Generate all possible combinations of predictors of size k
    # This is the core of BSS
    for subset in list(combinations(predictors, k)):
        
        # Get the design matrix (X)
        X_subset = Credit[list(subset)]
        
        # Add a constant (intercept) term to the model
        # Y = B0 + B1*X1 -> statsmodels needs B0 to be added manually
        X_subset_const = sm.add_constant(X_subset)
        
        # Get the target variable (y)
        y_target = Credit[target]
        
        # Fit the Ordinary Least Squares (OLS) model
        model = sm.OLS(y_target, X_subset_const).fit()
        
        # Calculate RSS
        RSS = ((model.resid) ** 2).sum()
        
        # (The full code would also calculate R-squared, Adj. R-sq, BIC, etc. here)
        # model_stats.append({'k': k, 'subset': subset, 'RSS': RSS, ...})

for k in range(1, nvmax + 1): This is the outer loop that iterates from $k=1$ (1 predictor) to $k=10$ (10 predictors).
list(combinations(predictors, k)): This is the inner loop and the most important line. The itertools.combinations function is a highly efficient way to generate all unique subsets.
- When $k=1$, it returns [('Income',), ('Limit',), ('Rating',), ...].
- When $k=2$, it returns [('Income', 'Limit'), ('Income', 'Rating'), ('Limit', 'Rating'), ...].
- This is what generates the $2^p$ (or in this case, $\sum_{k=1}^{10} \binom{p}{k}$) models to test.
sm.add_constant(X_subset): Your regression equation is $Y = \beta_0 + \beta_1X_1$. The $X_1$ is your X_subset. The sm.add_constant function adds a column of 1s to your data, which allows the statsmodels library to estimate the $\beta_0$ (intercept) term.
sm.OLS(y_target, X_subset_const).fit(): This fits the Ordinary Least Squares (OLS) model, which finds the $\beta$ coefficients that minimize the RSS.
model.resid: This attribute of the fitted model contains the residuals ($e_i = y_i - \hat{y}_i$) for each data point.
((model.resid) ** 2).sum(): This line is the direct code implementation of the formula $RSS = \sum e_i^2$.

Synthesizing the Results (The Plots)

After running the BSS code, you get the data used in the plots and the table.

Image 225550.png (Adjusted R-squared)
- Goal: Maximize.
- What it shows: The gray dots are all the models tested for each $k$. The red line connects the single best model for each $k$.
- Conclusion: The plot shows a sharp “elbow.” The $Adj. R^2$ increases dramatically up to $k=4$, then increases very slowly. The maximum is around $k=6$ or $k=7$, but the gain after $k=4$ is minimal.
Image 225554.png (BIC)
- Goal: Minimize.
- What it shows: BIC heavily penalizes complexity.
- Conclusion: The plot shows a very clear minimum. The BIC value plummets from $k=2$ to $k=3$ and hits its lowest point at $k=4$. After $k=4$, the penalty for adding more variables is larger than the benefit in model fit, so the BIC score starts to rise. This is a very strong vote for the 4-predictor model.
Image 225635.png (Mallow’s $C_p$)
- Goal: Minimize.
- What it shows: A very similar story to BIC.
- Conclusion: The $C_p$ value drops significantly and hits its minimum at $k=4$.
Image 225638.png (Summary Table)
- This is the most important image for the final conclusion. It summarizes the red line from all the plots.
- Look at the row for Num_Predictors = 4. The predictors are (Income, Limit, Cards, Student).
- Now look at the columns for BIC and Cp.
  - BIC: 4841.615607. This is the lowest value in the entire BIC column (the value at $k=3$ is 4865.352851).
  - Cp: 7.122228. This is also the lowest value in the Cp column.
- The Adj_R_squared at $k=4$ is 0.953580, which is very close to its maximum of ~0.954 at $k=7-10$.

Final Conclusion: All three “penalized” criteria (Adjusted $R^2$, BIC, and $C_p$) point to the same conclusion. While $Adj. R^2$ is a bit ambiguous, BIC and $C_p$ provide a clear signal that the best, most parsimonious model is the 4-predictor model using Income, Limit, Cards, and Student.

4. Subset Selection

Summary of Subset Selection

These slides introduce subset selection, a process in statistical learning used to identify the best subset of predictors (variables) for a regression model. The goal is to find a model that has low prediction error and avoids overfitting by excluding irrelevant variables.

The slides cover two main “greedy” (stepwise) algorithms and the criteria used to select the final best model.

Stepwise Selection Algorithms

Instead of testing all $2^p$ possible models (which is “best subset selection” and computationally unfeasible), stepwise methods build a single path of models.

Forward Stepwise Selection

This is an additive (bottom-up) approach:

Start with the null model (no predictors).
Find the best 1-variable model (the one that gives the lowest Residual Sum of Squares, or RSS).
Add the single variable that, when added to the current model, results in the new best model (lowest RSS).
Repeat this process until all $p$ predictors are in the model.
This generates a sequence of $p+1$ models, from $\mathcal{M}_0$ to $\mathcal{M}_p$.

Backward Stepwise Selection

This is a subtractive (top-down) approach:

Start with the full model containing all $p$ predictors.
Find the best $(p-1)$-variable model by removing the single variable that results in the lowest RSS (or highest $R^2$). This variable is considered the least significant.
Remove the next variable that, when removed from the current best model, gives the new best model.
Repeat until only the null model remains.
This also generates a sequence of $p+1$ models.

Pros and Cons (Backward Selection)

Pro: Computationally efficient compared to best subset. It fits $1 + \sum_{k=0}^{p-1}(p-k) = \mathbf{1 + p(p+1)/2}$ models, which is much less than $2^p$. (e.g., for $p=20$, it’s 211 models vs. >1 million).
Con: Cannot be used if $p > n$ (more predictors than observations), because the initial full model cannot be fit.
Con (for both): These methods are greedy. A variable added in forward selection is never removed, and a variable removed in backward selection is never added back. This means they are not guaranteed to find the true best model.

Choosing the Final Best Model

Both forward and backward selection give you a set of candidate models (e.g., the best 1-variable model, best 2-variable model, etc.). You must then choose the single best one. The slides show two main approaches:

A. Direct Error Estimation

Use a validation set or cross-validation (CV) to estimate the test error for each model (e.g., the 1-variable, 2-variable… models). Choose the model with the lowest estimated test error.

B. Adjusted Metrics (Penalizing for Complexity)

Standard RSS and $R^2$ will always improve as you add variables, leading to overfitting. Instead, use metrics that penalize the model for having too many predictors.

Mallows’ $C_p$: An estimate of test Mean Squared Error (MSE). \[C_p = \frac{1}{n} (RSS + 2d\hat{\sigma}^2)\] (where $d$ is the number of predictors, and $\hat{\sigma}^2$ is an estimate of the error variance). You want to find the model with the minimum $C_p$.
BIC (Bayesian Information Criterion): \[BIC = \frac{1}{n} (RSS + \log(n)d\hat{\sigma}^2)\] BIC’s penalty $\log(n)$ is stronger than $C_p$’s (or AIC’s) penalty of $2$, so it tends to select smaller (more parsimonious) models. You want to find the model with the minimum BIC.
Adjusted $R^2$: \[R^2_{adj} = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}\] (where $TSS$ is the Total Sum of Squares). Unlike $R^2$, this metric can decrease if adding a variable doesn’t help enough. You want to find the model with the maximum Adjusted $R^2$.

Python Code Understanding

The slides use the regsubsets() function from the leaps package in R.

# R Code from slides
library(leaps)
# Forward Selection
regfit.fwd <- regsubsets(Balance~., data=Credit, method="forward", nvmax=11)
# Backward Selection
regfit.bwd <- regsubsets(Balance~., data=Credit, method="backward", nvmax=11)

In Python, the standard tool for this is SequentialFeatureSelector from scikit-learn.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split

# Assume 'Credit' is a pandas DataFrame with 'Balance' as the target
X = Credit.drop('Balance', axis=1)
y = Credit['Balance']

# Initialize the linear regression estimator
model = LinearRegression()

# --- Forward Selection ---
# direction='forward' starts with 0 features and adds them
# To get the best 4-variable model, for example:
sfs_forward = SequentialFeatureSelector(
    model,
    n_features_to_select=4,
    direction='forward',
    cv=None # Or use cross-validation, e.g., cv=10
)
sfs_forward.fit(X, y)
print("Forward selection best 4 features:")
print(sfs_forward.get_feature_names_out())


# --- Backward Selection ---
# direction='backward' starts with all features and removes them
sfs_backward = SequentialFeatureSelector(
    model,
    n_features_to_select=4,
    direction='backward',
    cv=None
)
sfs_backward.fit(X, y)
print("\nBackward selection best 4 features:")
print(sfs_backward.get_feature_names_out())

# Note: To replicate the plots, you would loop this process,
# changing 'n_features_to_select' from 1 to p,
# record the model scores (e.g., RSS, AIC, BIC) at each step,
# and then plot the results.

Important Images

Slide ...230014.png (Forward Selection Plots) & ...230036.png (Backward Selection Plots):
- What they are: These $2 \times 2$ plot grids are the most important visuals. They show Residual Sum of Squares (RSS), Adjusted $R^2$, BIC, and Mallows’ $C_p$ plotted against the Number of Variables.
- Why they’re important: They are the decision-making tool. You use these plots to choose the best model.
  - You look for the “elbow” or minimum value for BIC and $C_p$.
  - You look for the “peak” or maximum value for Adjusted $R^2$.
  - (RSS is not used for selection as it always decreases).
Slide ...230040.png (Find the best model):
- What it is: This slide shows a close-up of the $C_p$, BIC, and Adjusted $R^2$ plots, with the “best” model (the min/max) marked with a blue ‘x’.
- Why it’s important: It explicitly states the selection criteria. The text highlights that BIC suggests a 4-variable model, while the other two are “rather flat” after 4, making the choice less obvious but pointing to a simple model.
Slide ...230045.png (BIC vs. Validation vs. CV):
- What it is: This shows three plots for selecting the best model using different criteria: BIC, Validation Set Error, and Cross-Validation Error.
- Why it’s important: It shows that different selection criteria can lead to different “best” models. Here, BIC (a mathematical adjustment) picks a 4-variable model, while validation and CV (direct error estimation) both pick a 6-variable model.

The slides use the Credit dataset to demonstrate two key tasks: 1. Running different subset selection algorithms (forward, backward, best). 2. Using various statistical metrics (BIC, $C_p$, CV error) to choose the single best model.

Comparing Selection Algorithms (The Path)

This part of the example compares the sequence of models selected by “Forward Stepwise” selection versus “Best Subset” selection.

Key Result (from Table 6.1):

This table is the most important result for comparing the algorithms.

Variables	Best Subset	Forward Stepwise
one	`rating`	`rating`
two	`rating`, `income`	`rating`, `income`
three	`rating`, `income`, `student`	`rating`, `income`, `student`
four	`cards`, `income`, `student`, `limit`	`rating`, `income`, `student`, `limit`

Summary of this result:

Identical for 1, 2, and 3 variables: Both methods agree on the best one-variable model (rating), the best two-variable model (rating, income), and the best three-variable model (rating, income, student).
They Diverge at 4 variables:
- Forward selection is greedy. It started with rating, income, student and was “stuck” with them. It then added limit, as that was the best variable to add to its existing 3-variable model.
- Best subset selection is not greedy. It tests all possible 4-variable combinations. It discovered that the model cards, income, student, limit has a slightly lower RSS than the model forward selection found.
Main Takeaway: This demonstrates the limitation of a greedy algorithm. Forward selection missed the “true” best 4-variable model because it was locked into its previous choices and couldn’t “swap out” rating for cards.

Choosing the Single Best Model (The Destination)

This is the most critical part of the analysis. After running a selection algorithm (like forward, backward, or best subset), you get a list of the “best” models for each size (best 1-variable, best 2-variable, etc.). Now you must decide: is the best model the 4-variable one, the 6-variable one, or another?

The slides show several plots to help make this decision, all plotted against the “Number of Predictors.”

Summary of Plot Results:

Here’s what each plot tells you:

Residual Sum of Squares (RSS) (e.g., in slide ...230014.png, top-left)
- What it shows: RSS always decreases as you add more variables. It drops sharply until 4 variables, then flattens out.
- Conclusion: This plot is not useful for picking the best model because it will always pick the full model, which is overfit. It’s only used to see the diminishing returns of adding new variables.
Adjusted $R^2$ (e.g., in slide ...230040.png, right)
- What it shows: This metric penalizes adding useless variables. The plot rises quickly, then flattens, peaking at its maximum value around 6 or 7 variables.
- Conclusion: This metric suggests a 6 or 7-variable model.
Mallows’ $C_p$ (e.g., in slide ...230040.png, left)
- What it shows: This is an estimate of test error. We want the model with the minimum $C_p$. The plot drops to a low value at 4 variables and stays low, with its absolute minimum around 6 or 7 variables.
- Conclusion: This metric also suggests a 6 or 7-variable model.
BIC (Bayesian Information Criterion) (e.g., in slide ...230040.png, center)
- What it shows: This is another estimate of test error, but it has a stronger penalty for model complexity. The plot shows a clear “U” shape, reaching its minimum value at 4 variables and then increasing afterward.
- Conclusion: This metric strongly suggests a 4-variable model.
Validation Set & Cross-Validation (CV) Error (Slide ...230045.png)
- What it shows: These plots show the direct estimate of test error (not a mathematical adjustment like BIC or $C_p$). Both the validation set error and the 10-fold CV error show a “U” shape.
- Conclusion: Both methods reach their minimum error at 6 variables. This is considered a very reliable result.

Final Summary of Results

The analysis of the Credit dataset reveals two strong candidates for the “best” model, depending on your goal:

The 6-Variable Model: This model is supported by the Adjusted $R^2$, Mallows’ $C_p$, and (most importantly) the Validation Set and 10-fold Cross-Validation results. These metrics all indicate that the 6-variable model has the lowest prediction error on new data.
The 4-Variable Model: This model is supported by BIC. Because BIC penalizes complexity more heavily, it selects a simpler (more parsimonious) model.

Overall Conclusion: If your primary goal is maximum predictive accuracy, you should choose the 6-variable model. If your goal is a simpler, more interpretable model that is still very good (and avoids any risk of overfitting), the 4-variable model is an excellent choice.

5. Two main strategies for controlling model complexity in linear regression

This presentation covers two main strategies for controlling model complexity in linear regression: Subset Selection (choosing which variables to include) and Shrinkage Methods (keeping all variables but reducing the impact of their coefficients).

Subset Selection

This method involves selecting a subset of the $p$ total predictors to use in the model.

Key Concepts & Formulas

The Model: The standard linear regression model is represented in matrix form: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\] The goal of subset selection is to find a coefficient vector $\boldsymbol{\beta}$ that is sparse, meaning it has many zero entries.
Forward Selection: This is a greedy algorithm that starts with an empty model and iteratively adds the single predictor that most improves the fit.
Theoretical Guarantee: Can forward selection find the true sparse set of variables?
- Yes, if the predictors are not strongly correlated.
- This is quantified by the Mutual Coherence Condition. Assuming the predictors $\mathbf{x}_i$ are normalized, the method is guaranteed to work if: \[\mu = \max_{i \neq j} |\langle \mathbf{x}_i, \mathbf{x}_j \rangle| < \frac{1}{2s - 1}\] where $s$ is the number of true non-zero coefficients and $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ represents the correlation between predictors.

Practical Application: Finding the Best Model Size

How do you know whether to choose a model with 3, 4, or 5 variables? You use Cross-Validation (CV).

Important Image: The plot titled “10-fold CV” (from the first slide) is the most important visual. It plots the estimated test error (CV Error) on the y-axis against the number of variables in the model on the x-axis.
The “One Standard Deviation Rule”: Looking at the plot, the error drops sharply and then flattens. The absolute minimum error might be at 6 variables, but it’s only slightly better than the 3-variable model.
1. Find the model with the lowest CV error.
2. Calculate the standard error for that error estimate.
3. Select the simplest model (fewest variables) whose error is within one standard deviation of the minimum.
4. This follows Occam’s razor: choose the simplest explanation (model) that fits the data well enough. In the example given, this rule selects the 3-variable model.

Code Interpretation (R vs. Python)

The R code in the first slide performs this 10-fold CV manually for forward selection:

It loops from p = 1 to 10 (model sizes).
Inside the loop, it identifies the p variables chosen by a pre-computed forward selection model (regfit.fwd).
It fits a new model (glm.fit) using only those p variables.
It runs 10-fold CV (cv.glm) on that specific model to get its test error.
It stores the error in CV10.err[p].
Finally, it plots the results.

In Python (with scikit-learn): This entire process is often automated.

You would use sklearn.feature_selection.RFECV (Recursive Feature Elimination with Cross-Validation).
RFECV automatically performs cross-validation to find the optimal number of features, effectively producing the same plot and result as the R code.

# Conceptual Python equivalent for finding the best model size
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_regression

# X, y = load_your_data()
X, y = make_regression(n_samples=100, n_features=10, n_informative=3, noise=10, random_state=42)

estimator = LinearRegression()
# RFECV will test models with 1 feature, 2 features, etc.,
# and use cross-validation (cv=10) to find the best number.
selector = RFECV(estimator, step=1, cv=10, scoring='neg_mean_squared_error')
selector = selector.fit(X, y)

print(f"Optimal number of features: {selector.n_features_}")
# You can plot selector.cv_results_['mean_test_score'] to get the CV curve

Shrinkage Methods (Regularization)

Instead of explicitly removing variables, shrinkage methods keep all $p$ variables but shrink their coefficients $\beta_j$ towards zero.

Ridge Regression

Ridge regression is a prime example of a shrinkage method.

Objective Function: It finds the coefficients $\boldsymbol{\beta}$ that minimize a new quantity: \[\underbrace{\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2}_{\text{RSS (Goodness of Fit)}} + \underbrace{\lambda \sum_{j=1}^{p} \beta_j^2}_{\text{$\ell_2$ Penalty (Shrinkage)}}\]
The $\lambda$ Tuning Parameter: This parameter controls the strength of the penalty:
- If $\lambda = 0$: The penalty term disappears. Ridge regression is identical to standard Ordinary Least Squares (OLS).
- If $\lambda \to \infty$: The penalty is “infinitely” strong. To minimize the function, all coefficients $\beta_j$ (for $j=1...p$) are forced to be zero. The model becomes an intercept-only model.
- Note: The intercept $\beta_0$ is not penalized.
The Bias-Variance Trade-off: This is the core concept of regularization.
- Standard OLS has low bias but can have high variance (it overfits).
- Ridge regression adds a small amount of bias (the coefficients are “wrong” on purpose) to significantly reduce the model’s variance.
- This trade-off often leads to a model with a lower overall test error.
Matrix Solution: The discussion slide asks “What is the solution?”. While OLS has the solution $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$, the Ridge solution is: \[\hat{\boldsymbol{\beta}}^R = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\] where $\mathbf{I}$ is the identity matrix. The $\lambda \mathbf{I}$ term adds a “ridge” to the diagonal, making the matrix invertible even if $\mathbf{X}^T\mathbf{X}$ is singular (which happens if $p > n$ or predictors are collinear).

An Essential Step: Standardization

Problem: The $\ell_2$ penalty $\lambda \sum \beta_j^2$ is applied equally to all coefficients. If predictor $x_1$ (e.g., house size in sq-ft) is on a much larger scale than $x_2$ (e.g., number of rooms), its coefficient $\beta_1$ will naturally be much smaller than $\beta_2$. The penalty will unfairly punish $\beta_2$ more.
Solution: You must standardize your inputs before fitting a Ridge model.
Formula: For each predictor $X_j$, all its observations $x_{ij}$ are rescaled: \[\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j}\] (where $\bar{x}_j$ is the mean of the predictor and $\sigma_j$ is its standard deviation). This puts all predictors on a common scale (mean=0, std=1).

In Python (with scikit-learn):

You use sklearn.preprocessing.StandardScaler to standardize your data.
You use sklearn.linear_model.Ridge to fit the model.
You use sklearn.linear_model.RidgeCV to automatically find the best value for $\lambda$ (called alpha in scikit-learn) using cross-validation.

# Conceptual Python code for Ridge Regression
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# X, y = load_your_data()

# Create a pipeline that first standardizes the data,
# then fits a Ridge model.
# RidgeCV tests a range of alphas (lambdas) automatically.
model = make_pipeline(
    StandardScaler(),
    RidgeCV(alphas=[0.1, 1.0, 10.0, 100.0], scoring='neg_mean_squared_error')
)

model.fit(X, y)

print(f"Best alpha (lambda): {model.named_steps['ridgecv'].alpha_}")
print(f"Model coefficients: {model.named_steps['ridgecv'].coef_}")

Subset Selection

This section is about choosing which predictors (variables) to include in your linear model. The main idea is to find a “sparse” model (one with few variables) that performs well.

The Model and The Goal

Slide: “Forward selection in Linear Regression”
Formula: The standard linear regression model is $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$
- $\mathbf{y}$ is the $n \times 1$ vector of outcomes.
- $\mathbf{X}$ is the $n \times (p+1)$ matrix of predictors (with a leading column of 1s for the intercept).
- $\boldsymbol{\beta}$ is the $(p+1) \times 1$ vector of coefficients ($\beta_0, \beta_1, ..., \beta_p$).
- $\boldsymbol{\epsilon}$ is the $n \times 1$ vector of irreducible error.
Key Question: “If $\boldsymbol{\beta}$ is sparse with at most $s$ non-zero entries, can forward selection find those variables?”
- Sparse means most coefficients are zero.
- Forward Selection is a greedy algorithm:
  1. Start with no variables.
  2. Add the one variable that gives the best fit.
  3. Add the next best variable to the existing model.
  4. Repeat until you have a model with $s$ variables.
- The slide suggests the answer is yes, but only under certain conditions.

The Condition for Success

Slide: “Orthogonal Matching Pursuit”
Key Concept: Forward selection can provably find the correct variables if those variables are not strongly correlated.
Formula: This is formalized by the Mutual Coherence Condition: \[\mu = \max_{i \neq j} |\langle \mathbf{x}_i, \mathbf{x}_j \rangle| < \frac{1}{2s - 1}\]
- What it means:
  - assuming $\mathbf{x}_i$'s are normalized means we’ve scaled them to have a length of 1.
  - $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ is the dot product, which is just their correlation since they are normalized.
  - $\mu$ (mu) is the largest absolute correlation you can find between any two different predictors.
  - $s$ is the true number of important variables.
- In English: If the maximum correlation between any of your predictors is less than this threshold, the greedy forward selection algorithm is guaranteed to find the true, sparse set of variables.

How to Choose the Model Size (Practice)

The theory is nice, but in practice, you don’t know $s$. How many variables should you pick?

Slide: “10-fold CV Errors”
This is the most important practical slide for this section.
What the plot shows:
- X-axis: “Number of Variables” (from 1 to 10).
- Y-axis: “CV Error” (the 10-fold cross-validated Mean Squared Error).
- The Curve: The error drops very fast as we add the first 2-3 variables. Then, it flattens out. Adding more than 3 variables doesn’t really help much.
Slide: “The one standard deviation rule”
This rule helps you pick the “best” model from the CV plot.
1. Find the model with the absolute minimum CV error (in the plot, this looks to be around 6 or 7 variables).
2. Calculate the standard error of that minimum CV error.
3. Draw a “tolerance” line at (minimum error) + (one standard error).
4. Choose the simplest model (fewest variables) whose CV error is below this tolerance line.
- The slide states this rule “gives the model with 3 variable” for this example. This is because the 3-variable model is much simpler than the 6-variable one, and its error is “good enough” (within one standard deviation of the minimum). This is an application of Occam’s razor.

Code: R vs. Python

The R code on the “10-fold CV Errors” slide generates that exact plot.

R Code Explained:
- library(boot): Loads the cross-validation library.
- CV10.err=rep(0,10): Creates an empty vector to store the 10 error scores.
- for(p in 1:10): A loop that will test model sizes from 1 to 10.
- x<-which(summary(regfit.fwd)$which[p,]): Gets the names of the $p$ variables chosen by a pre-run forward selection (regfit.fwd).
- glm.fit=glm(Balance~.,data=newCred): Fits a model using only those $p$ variables.
- cv.err=cv.glm(newCred,glm.fit,K=10): Performs 10-fold CV on that specific $p$-variable model.
- CV10.err[p]<-cv.err$delta[1]: Stores the CV error.
- plot(...): Plots the 10 errors against the 10 model sizes.
Python Equivalent (Conceptual):
- In scikit-learn, this process is often automated. You wouldn’t write the CV loop yourself.
- You would use sklearn.feature_selection.RFECV (Recursive Feature Elimination with Cross-Validation). This tool automatically wraps a model (like LinearRegression), performs cross-validation, and finds the optimal number of features, effectively producing the same plot and result.

# --- Python equivalent for 6.1 ---
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Assume X and y are your data

# 1. Create a pipeline
# (Note: It's good practice to scale, even for OLS, if you're comparing)
pipeline = make_pipeline(
    StandardScaler(),
    LinearRegression()
)

# 2. Create the RFECV (Recursive Feature Elimination w/ CV) object
# This is an *alternative* to forward selection, but serves the same purpose
# It will test models with 1, 2, 3... features using 10-fold CV
feature_selector = RFECV(
    estimator=pipeline, 
    min_features_to_select=1, 
    step=1, 
    cv=10, 
    scoring='neg_mean_squared_error' # We want to minimize error
)

# 3. Fit it
feature_selector.fit(X, y)

print(f"Optimal number of features found: {feature_selector.n_features_}")

# You could then plot feature_selector.cv_results_['mean_test_score']
# to replicate the R plot.

Shrinkage Methods by Regularization

This is a different approach. Instead of removing variables, we keep all $p$ variables but shrink their coefficients $\beta_j$ towards 0.

Ridge Regression: The Core Idea

Slide: “Ridge regression”
Formula: Ridge regression minimizes a new objective function: \[\min_{\boldsymbol{\beta}} \left( \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right)\]
- Term 1: $\text{RSS}$ (Residual Sum of Squares). This is the original OLS “goodness of fit” term. We want this to be small.
- Term 2: $\lambda \sum \beta_j^2$. This is the $\ell_2$ penalty or “shrinkage penalty”. It adds a “cost” for having large coefficients.
The $\lambda$ (lambda) Parameter:
- This is the tuning parameter that controls the trade-off between fit and simplicity.
- $\lambda = 0$ : No penalty. The objective is just to minimize RSS. The solution $\hat{\boldsymbol{\beta}}^R$ is identical to the OLS solution $\hat{\boldsymbol{\beta}}$.
- $\lambda = \infty$ : Infinite penalty. The only way to minimize the cost is to make all $\beta_j = 0$ (for $j \ge 1$). The model becomes an intercept-only model.
- Large $\lambda$: Heavy penalty, more shrinkage.
- Crucial Note: The intercept $\beta_0$ is not penalized. This is because $\beta_0$ just represents the mean of $y$ when all $x$’s are 0; shrinking it makes no sense.

The Need for Standardization

Slide: “Standardize the inputs”
Problem: The penalty $\lambda \sum \beta_j^2$ is applied to all coefficients. But what if $x_1$ is “house size in sq-ft” (values 1000-5000) and $x_2$ is “number of bedrooms” (values 1-5)?
- The coefficient $\beta_1$ for house size will naturally be tiny, while the coefficient $\beta_2$ for bedrooms will be large, even if they are equally important.
- Ridge regression would unfairly and heavily penalize $\beta_2$ while barely touching $\beta_1$.
Solution: You must standardize all predictors before fitting a Ridge model.
Formula: For each observation $i$ of each predictor $j$: \[\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sqrt{(1/n) \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)^2}}\]
- This formula rescales every predictor to have a mean of 0 and a standard deviation of 1.
- Now, all coefficients $\beta_j$ are on a “level playing field” and can be penalized fairly.

Answering the Discussion Questions

Slide: “DISCUSSION”
- What is the solution of Ridge regression?
- What is the bias and the variance?

1. What is the solution of Ridge regression?

The solution can be written in matrix form, which is very elegant.

Standard OLS Solution: The coefficients $\hat{\boldsymbol{\beta}}^{\text{OLS}}$ that minimize RSS are found by: \[\hat{\boldsymbol{\beta}}^{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
Ridge Regression Solution: The coefficients $\hat{\boldsymbol{\beta}}^{R}$ that minimize the Ridge objective are: \[\hat{\boldsymbol{\beta}}^{R} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]
- Explanation:
  - $\mathbf{I}$ is the identity matrix (a matrix of 1s on the diagonal, 0s everywhere else).
  - By adding $\lambda\mathbf{I}$, we are adding a positive value $\lambda$ to the diagonal of the $\mathbf{X}^T\mathbf{X}$ matrix.
  - This addition stabilizes the matrix. $\mathbf{X}^T\mathbf{X}$ might not be invertible (if $p > n$ or if predictors are perfectly collinear), but $(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})$ is always invertible for $\lambda > 0$.
  - This addition is what mathematically “shrinks” the coefficients toward zero.

2. What is the bias and the variance?

This is the most important concept in regularization. It’s the bias-variance trade-off.

Standard OLS (where $\lambda=0$):
- Bias: Low. The OLS estimator is unbiased, meaning that if you took many samples and fit many OLS models, their average $\hat{\boldsymbol{\beta}}$ would be the true $\boldsymbol{\beta}$.
- Variance: High. The OLS solution can be highly sensitive to the training data. If you change a few data points, the coefficients can swing wildly. This is especially true if $p$ is large or predictors are correlated. This “sensitivity” is high variance, which leads to overfitting.
Ridge Regression (where $\lambda > 0$):
- Bias: High(er). Ridge regression is a biased estimator. By adding the penalty, we are purposefully pulling the coefficients away from the OLS solution and towards zero. The average $\hat{\boldsymbol{\beta}}^R$ from many samples will not equal the true $\boldsymbol{\beta}$. We have introduced bias into our model.
- Variance: Low(er). In exchange for this bias, we get a massive reduction in variance. The $\lambda\mathbf{I}$ term stabilizes the solution. The coefficients won’t change wildly even if the training data changes. The model is more robust and less sensitive.

The Trade-off: The total expected test error of a model is: $\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$

By using Ridge regression, we increase the $\text{Bias}^2$ term a little, but we decrease the $\text{Variance}$ term a lot. The goal is to find a $\lambda$ where the total error is minimized. Ridge regression reduces variance at the cost of increased bias.

Python Equivalent for 6.2

# --- Python equivalent for 6.2 ---
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Assume X and y are your data

# 1. Create a pipeline that AUTOMATICALLY
#    - Standardizes the data
#    - Fits a Ridge Regression model
#    - Uses Cross-Validation to find the BEST lambda (alpha in scikit-learn)
alphas_to_test = [0.01, 0.1, 1.0, 10.0, 100.0]

# RidgeCV handles everything for us
pipeline = make_pipeline(
    StandardScaler(),
    RidgeCV(alphas=alphas_to_test, scoring='neg_mean_squared_error', cv=10)
)

# 2. Fit the pipeline
pipeline.fit(X, y)

# 3. Get the results
best_lambda = pipeline.named_steps['ridgecv'].alpha_
ridge_coefficients = pipeline.named_steps['ridgecv'].coef_
intercept = pipeline.named_steps['ridgecv'].intercept_

print(f"Best lambda (alpha) found by CV: {best_lambda}")
print(f"Model intercept (beta_0): {intercept}")
print(f"Model coefficients (beta_j): {ridge_coefficients}")

6. The “Why” of Ridge Regression

Core Concepts: The “Why” of Ridge Regression

Your slides explain that ridge regression is a “shrinkage method” designed to solve a major problem with standard Ordinary Least Squares (OLS) regression: high variance.

The Bias-Variance Tradeoff (Slide 3)

This is the most important theoretical concept. In prediction, the total error (Mean Squared Error, or MSE) of a model is composed of three parts: $\text{Error} = \text{Variance} + \text{Bias}^2 + \text{Irreducible Error}$

Ordinary Least Squares (OLS): Aims to be unbiased (low bias). However, when you have many predictors ($p$), especially if they are correlated, or if $p$ is large compared to the number of samples $n$ ($p \approx n$ or $p > n$), the OLS model becomes highly unstable. A small change in the training data can cause the coefficients to change wildly. This is high variance. (See Slide 6, “Remarks”).
Ridge Regression: By adding a penalty, ridge intentionally introduces a small amount of bias (it pulls coefficients away from their “true” OLS values). In return, it achieves a massive reduction in variance.

As Slide 3 shows:

The green line (Variance) starts very high for low $\lambda$ (left side) and drops quickly.
The black line (Squared Bias) starts at zero (for OLS at $\lambda=0$) and slowly increases as $\lambda$ grows.
The purple line (Test MSE) is the sum of the two. It’s U-shaped. The goal of ridge is to find the $\lambda$ (marked by the ‘x’) at the bottom of this “U,” which gives the lowest possible total error.

Why Is It Called “Ridge”? The 3D Spatial Meaning (Slide 5)

This slide explains the problem of collinearity and the origin of the name.

Left Plot (Least Squares): Imagine a model with two correlated predictors, $\beta_1$ and $\beta_2$. The y-axis (SS1) is the error (RSS). Because the predictors are correlated, there isn’t one single “point” that is the minimum. Instead, there’s a long, flat valley or trough (marked “unstable”). Many different combinations of $\beta_1$ and $\beta_2$ along this valley give a similarly low error. The OLS solution is unstable because it can pick any point in this flat-bottomed valley.
Right Plot (Ridge): The ridge objective function adds a penalty term: $\lambda(\beta_1^2 + \beta_2^2)$. This penalty term, by itself, is a perfect circular bowl centered at (0,0). When you add this “bowl” to the OLS “valley,” it stabilizes the function. It pulls the minimum towards (0,0) and creates a single, stable, well-defined minimum.
The “Ridge” Name: The penalty $\lambda\mathbf{I}$ (from the matrix formula) adds a “ridge” of values to the diagonal of the $\mathbf{X}^T\mathbf{X}$ matrix, which geometrically turns the unstable flat valley into a stable bowl.

Mathematical Formulas

The key difference between OLS and Ridge is the function they try to minimize.

OLS Objective Function: Minimize the Residual Sum of Squares (RSS). \[\text{RSS} = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2\]
Ridge Objective Function (Slide 6): Minimize the RSS plus an L2 penalty term. \[\text{Minimize: } \left[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 \right] + \lambda \sum_{j=1}^{p} \beta_j^2\]
- $\lambda$ is the tuning parameter controlling the penalty strength.
- $\sum_{j=1}^{p} \beta_j^2$ is the L2-norm (squared) of the coefficients. It penalizes large coefficients.
L2 Norm (Slide 1): The L2 norm of a vector $\mathbf{a}$ is its standard Euclidean length. The plot on Slide 1 uses this to show the total magnitude of the ridge coefficients. \[\|\mathbf{a}\|_2 = \sqrt{\sum_{j=1}^p a_j^2}\]
Matrix Solution (Slide 6): This is the “closed-form” solution for the ridge coefficients $\hat{\beta}^R$. \[\hat{\beta}^R = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]
- $\mathbf{I}$ is the identity matrix.
- The term $\lambda\mathbf{I}$ is what stabilizes the $\mathbf{X}^T\mathbf{X}$ matrix, making it invertible even if it’s singular (due to $p > n$ or collinearity).

Walkthrough of the “Credit Data” Example (All Slides)

Here is the logical story of the R code, from start to finish.

Step 1: Data Preparation (Slide 8)

x=scale(model.matrix(Balance~., Credit)[,-1])
- model.matrix(...) creates the predictor matrix x.
- scale(...) is critically important. It standardizes all predictors to have a mean of 0 and a standard deviation of 1. This is necessary because the ridge penalty $\lambda \sum \beta_j^2$ is unit-dependent. If Income (in 10,000s) and Cards (1-10) were unscaled, the penalty would unfairly crush the Income coefficient. Scaling puts all predictors on a level playing field.
y=Credit$Balance
- This sets the y (target) variable.

Step 2: Fit the Ridge Model (Slide 8)

grid=10^seq(4,-2,length=100)
- This creates a grid of 100 $\lambda$ values to test, ranging from $10^4$ (a huge penalty) down to $10^{-2}$ (a tiny penalty).
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
- This is the main command. It fits a separate ridge model for every single $\lambda$ in the grid.
- alpha=0 is the specific command that tells glmnet to perform Ridge Regression. (Setting alpha=1 would be LASSO).
coef(ridge.mod)[,50]
- This inspects the model. It pulls out the vector of coefficients for the 50th $\lambda$ in the grid (which is $\lambda=10.72$).

Step 3: Visualize the Coefficient “Solution Path” (Slides 1, 4, 9)

These plots all show the same thing: how the coefficients change as $\lambda$ changes.

Slide 9 Plot: This plots the standardized coefficients for 4 predictors (Income, Limit, Rating, Student) against the index (1 to 100). Index 1 (left) is the largest $\lambda$, and index 100 (right) is the smallest $\lambda$ (closest to OLS). You can see the coefficients “grow” from 0 as the penalty ($\lambda$) gets smaller.
Slide 1 (Left Plot): This is the same plot as Slide 9, but more professional. It plots the coefficients against $\lambda$ on a log scale. You can clearly see all coefficients (gray lines) being “shrunk” toward zero as $\lambda$ increases (moves right). The key predictors (Income, Rating, etc.) are highlighted.
Slide 1 (Right Plot): This is the exact same data again, but with a different x-axis: $\|\hat{\beta}_\lambda^R\|_2 / \|\hat{\beta}\|_2$.
- 1.0 on the right means $\lambda=0$. The ratio of the ridge norm to the OLS norm is 1 (they are the same).
- 0.0 on the left means $\lambda=\infty$. The ridge coefficients are all 0, so their norm is 0.
- This axis shows the “fraction” of the full OLS coefficient magnitude that the model is using.
Slide 4 Plot: This plots the total L2 norm of all coefficients ($\|\hat{\beta}_\lambda^R\|_2$) against the index. As the index goes from 1 to 100 (i.e., $\lambda$ gets smaller), the total magnitude of the coefficients gets larger, which is exactly what we expect.

Step 4: Find the Best $\lambda$ using Cross-Validation (Slides 4 & 7)

We have 100 models. Which one is best?

The “Manual” Way (Slide 4):
- The code splits the data into a train and test set.
- It fits a model only on the train set.
- It tests two $\lambda$ values:
  - s=4: Gives a test MSE of 10293.33.
  - s=10: Gives a test MSE of 168981.1 (much worse!).
- This shows that $\lambda=4$ is better than $\lambda=10$, but we don’t know if it’s the best.
The “Automatic” Way (Slide 7):
- cv.out=cv.glmnet(x[train,], y[train], alpha=0)
- This runs 10-fold Cross-Validation on the training set. It automatically splits the training set into 10 “folds,” trains on 9, tests on 1, and repeats this 10 times for every $\lambda$.
- The Plot: The plot on this slide is the result. It shows the average MSE (y-axis) for each $\log(\lambda)$ (x-axis). This is the real-data version of the theoretical purple curve from Slide 3.
- bestlam=cv.out$lambda.min
- This command finds the $\lambda$ at the very bottom of the U-shaped curve. The output shows bestlam is 41.6.
- ridge.pred=predict(ridge.mod, s=bestlam, newx=x[test,])
- Now, we use this one best $\lambda$ to make predictions on our held-out test set.
- mean((ridge.pred-y.test)^2)
- The final, reliable test MSE is 16129.68. This is our best estimate of how the model will perform on new, unseen data.

Python (`scikit-learn`) Equivalents

Here is how you would perform the entire R workflow from your slides in Python.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# --- 1. Load and Prepare Data (like Slide 8) ---
# Assuming 'Credit' is a pandas DataFrame
# X = Credit.drop('Balance', axis=1)
# y = Credit['Balance']
# ... (need to handle categorical variables first, e.g., with pd.get_dummies) ...
# For this example, let's assume X and y are already loaded and numeric.

# Standardize the predictors (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 2. Train/Test Split (like Slide 4) ---
# test_size=0.5 and random_state=1 mimic the R code
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.5, random_state=1
)

# --- 3. Find Best Lambda (alpha) with Cross-Validation (like Slide 7) ---
# Create the same log-spaced grid of lambdas (sklearn calls it 'alpha')
lambda_grid = np.logspace(4, -2, 100)

# RidgeCV performs cross-validation to find the best alpha
# cv=10 matches the 10-fold CV
# store_cv_values=True is needed to plot the CV error curve
cv_model = RidgeCV(alphas=lambda_grid, store_cv_values=True, cv=10)
cv_model.fit(X_train, y_train)

# Get the best lambda found
best_lambda = cv_model.alpha_
print(f"Best lambda (alpha) found by CV: {best_lambda}")

# Plot the CV error curve (like Slide 7 plot)
# cv_model.cv_values_ has shape (n_samples, n_alphas)
# We need to average over the samples for each alpha
mse_path = np.mean(cv_model.cv_values_, axis=0)
plt.figure()
plt.plot(np.log10(cv_model.alphas_), mse_path, marker='o')
plt.xlabel("Log(lambda)")
plt.ylabel("Mean Squared Error")
plt.title("Cross-Validation Error Path")
plt.show()

# --- 4. Evaluate on Test Set (like Slide 7) ---
# 'cv_model' is already refit on the full training set using the best_lambda
test_pred = cv_model.predict(X_test)
final_test_mse = mean_squared_error(y_test, test_pred)
print(f"Final Test MSE with best lambda: {final_test_mse}")

# --- 5. Get Final Coefficients (like Slide 7, bottom) ---
# The coefficients from the CV-trained model:
print(f"Intercept: {cv_model.intercept_}")
print("Coefficients:")
for coef, feature in zip(cv_model.coef_, X.columns):
    print(f"  {feature}: {coef}")

# --- 6. Plot the Solution Path (like Slide 1) ---
# To do this, we fit a Ridge model for each lambda and store the coefficients
coefs = []
for lam in lambda_grid:
    model = Ridge(alpha=lam)
    model.fit(X_scaled, y)  # Fit on all data
    coefs.append(model.coef_)

# Plot
plt.figure()
plt.plot(np.log10(lambda_grid), coefs)
plt.xlabel("Log(lambda)")
plt.ylabel("Standardized Coefficients")
plt.title("Ridge Solution Path")
plt.show()

7. Shrinkage Methods (Regularization)

These slides cover Shrinkage Methods, also known as Regularization, which are techniques used to improve on the standard least squares model, particularly when dealing with many variables or multicollinearity. The main focus is on LASSO regression.

Key Mathematical Formulas

The slides present two main, but equivalent, ways to formulate these methods.

1. Penalized Formulation (Slide 1)

This is the most common formulation. The goal is to minimize a function that is a combination of the Residual Sum of Squares (RSS) and a penalty term. The penalty discourages large coefficients.

LASSO (Least Absolute Shrinkage and Selection Operator): The goal is to find coefficients ($\beta_0, \beta_j$) that minimize: \[\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j|\]
- Penalty: The $L_1$ norm ($\|\beta\|_1$), which is the sum of the absolute values of the coefficients.
- Key Property: This penalty can force some coefficients to be exactly zero, effectively performing automatic variable selection.

2. Constrained Formulation (Slide 2)

This alternative formulation minimizes the RSS subject to a constraint (a “budget”) on the size of the coefficients.

For Lasso: Minimize RSS subject to: \[\sum_{j=1}^{p} |\beta_j| \le s\] (The sum of the absolute values of the coefficients must be less than some budget $s$.)
For Ridge: Minimize RSS subject to: \[\sum_{j=1}^{p} \beta_j^2 \le s\] (The sum of the squares of the coefficients ($L_2$ norm) must be less than $s$.)

Equivalence (Slide 3): For any penalty value $\lambda$ used in the first formulation, there is a corresponding budget $s$ in the second formulation that will give the exact same set of coefficients. $\lambda$ and $s$ are inversely related: a large $\lambda$ (high penalty) corresponds to a small $s$ (small budget).

Important Plots and Interpretation

Your slides show the two most important plots for understanding and using LASSO.

1. The Cross-Validation (CV) Plot (Slide 5)

This plot is crucial for choosing the best tuning parameter ($\lambda$).

X-axis: $\text{Log}(\lambda)$. This is the penalty strength.
- Right side (high $\lambda$): High penalty, simple model (many coefficients are 0), high bias, high Mean-Squared Error (MSE).
- Left side (low $\lambda$): Low penalty, complex model (like standard linear regression), high variance, MSE starts to increase (overfitting).
Y-axis: Mean-Squared Error (MSE) from cross-validation.
Goal: Find the $\lambda$ at the bottom of the “U” shape, which gives the lowest MSE. This is the optimal trade-off between bias and variance. The top axis shows how many variables are included in the model at each $\lambda$.

2. The Coefficient Path Plot (Slide 6)

This plot is the best visualization for understanding what LASSO does.

Left Plot (vs. $\lambda$):
- X-axis: The penalty strength $\lambda$.
- Y-axis: The standardized value of each coefficient.
- How to read it: Start from the right (high $\lambda$). All coefficients are 0. As you move left, $\lambda$ decreases, and the penalty is relaxed. Variables “enter” the model one by one (their coefficients become non-zero). You can see that ‘Rating’, ‘Income’, and ‘Student’ are the most important variables, as they are the first to become non-zero.
Right Plot (vs. $L_1$ Norm Ratio):
- This shows the exact same information as the left plot, but the x-axis is reversed and rescaled. An axis value of 0.0 means full penalty (all $\beta=0$), and 1.0 means no penalty.

Code Understanding (R to Python)

The slides use the glmnet package in R. The equivalent and most popular library in Python is scikit-learn.

1. Finding the Best $\lambda$ (CV)

The R code cv.out=cv.glmnet(x[train,],y[train],alpha=1) performs cross-validation to find the best $\lambda$.

Python Equivalent: Use LassoCV. It does the same thing: tests many $\lambda$ values (called alphas in scikit-learn) and picks the best one.

from sklearn.linear_model import LassoCV

# Create the LassoCV object
# cv=5 means 5-fold cross-validation
lasso_cv = LassoCV(cv=5, random_state=0)

# Fit the model to the training data
lasso_cv.fit(X_train, y_train)

# Get the best lambda (called alpha_ in sklearn)
best_lambda = lasso_cv.alpha_
print(f"Best lambda (alpha): {best_lambda}")

# Get the MSEs
# This is what's plotted in the CV plot
print(lasso_cv.mse_path_)

2. Fitting with the Best $\lambda$ and Getting Coefficients

The R code lasso.coef=predict(out,type="coefficients",s=bestlam) gets the coefficients for the best $\lambda$.

Python Equivalent: The LassoCV object is already refitted on the full training data using the best $\lambda$. You can also fit a new Lasso model with that specific $\lambda$.

from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

# --- Option 1: Use the already-fitted LassoCV object ---
print("Coefficients from LassoCV:")
print(lasso_cv.coef_)

# Make predictions on the test set
y_pred = lasso_cv.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {test_mse}")


# --- Option 2: Fit a new Lasso model with the best lambda ---
final_lasso = Lasso(alpha=best_lambda)
final_lasso.fit(X_train, y_train)

# Get coefficients (Slide 7 shows this)
# Note how some are 0!
print("\nCoefficients from new Lasso model:")
print(final_lasso.coef_)

The Core Problem: Two Equivalent Formulas

The slides show two ways of writing the same problem. Understanding this equivalence is key.

Formulation 1: The Penalized Method (Slides 1 & 4)

Formula: \[\min_{\beta} \left( \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \beta)^2 + \lambda \|\beta\|_1 \right)\]
- $\sum (y_i - \mathbf{x}_i^T \beta)^2$: This is the normal Residual Sum of Squares (RSS). We want to make this small (fit the data well).
- $\lambda \|\beta\|_1$: This is the $L_1$ penalty.
  - $\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|$ is the sum of the absolute values of the coefficients.
  - $\lambda$ (lambda) is a tuning parameter. Think of it as a “penalty knob”.
How to think about $\lambda$:
- If $\lambda = 0$: There is no penalty. This is just standard Ordinary Least Squares (OLS) regression. The model will likely overfit.
- If $\lambda$ is small: There’s a small penalty. Coefficients will shrink a little bit.
- If $\lambda$ is very large: The penalty is severe. The only way to make the penalty term small is to make the coefficients ($\beta$) themselves small. The model will eventually shrink all coefficients to exactly 0.

Formulation 2: The Constrained Method (Slides 2 & 3)

Formula: \[\min_{\beta} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \beta)^2 \quad \text{subject to} \quad \|\beta\|_1 \le s\]
How to think about $s$:
- This says: “Find the best-fitting model (minimize RSS) but you have a limited ‘budget’ $s$ for the total size of your coefficients.”
- If $s$ is very large: The budget is huge. This constraint does nothing. You get the standard OLS solution.
- If $s$ is small: The budget is tight. You must shrink your coefficients to stay under the budget $s$. To get the best fit, the model will be forced to set unimportant coefficients to 0 and only “spend” its budget on the most important variables.

The Equivalence: These two forms are equivalent. For any $\lambda$ you pick, there’s a corresponding budget $s$ that gives the exact same solution.

High $\lambda$ (strong penalty) $\iff$ Small $s$ (tight budget)
Low $\lambda$ (weak penalty) $\iff$ Large $s$ (loose budget)

This equivalence is why you see plots with both $\lambda$ and $L_1$ Norm on the x-axis. They are just two different ways of looking at the same “penalty” spectrum.

Detailed Plot & Code Analysis

Let’s look at the plots and code, which answer the practical questions: (1) How do we pick the best $\lambda$? and (2) What does LASSO do to the coefficients?

Question 1: How to pick the best $\lambda$? (Slide 5)

This is the Cross-Validation (CV) Plot. Its one and only job is to help you find the optimal $\lambda$.

R Code: cv.out=cv.glmnet(x[train,],y[train],alpha=1)
- cv.glmnet: This R function automatically does K-fold cross-validation. alpha=1 explicitly tells it to use LASSO (alpha=0 would be Ridge).
- It tries a whole range of $\lambda$ values, calculates the Mean-Squared Error (MSE) for each, and stores the results in cv.out.
Plot Analysis:
- X-axis: $\text{Log}(\lambda)$. The penalty strength. Right = High Penalty (simple model), Left = Low Penalty (complex model).
- Y-axis: Mean-Squared Error (MSE). Lower is better.
- Red Dots: The average MSE for each $\lambda$.
- Gray Bars: The error bars (standard error).
- The “U” Shape: This is the classic bias-variance trade-off.
  - Right Side (High $\lambda$): The model is too simple (too many coefficients are 0). It’s “underfitting.” The error is high (high bias).
  - Left Side (High $\lambda$): The model is too complex (low penalty, like OLS). It’s “overfitting” the training data. The error on new data is high (high variance).
  - Bottom of the “U”: This is the “sweet spot.” The $\lambda$ at the very bottom (marked by the left vertical dotted line) gives the lowest possible MSE. This is lambda.min.

Answer: You pick the $\lambda$ that corresponds to the lowest point on this graph.

Question 2: What does LASSO do? (Slides 5, 6, 7)

These slides all show the effect of LASSO.

A. The Coefficient Path Plots (Slides 5 & 6)

These plots visualize how coefficients change. They show the same information just with different x-axes.

Left Plot (Slide 6) vs. $\lambda$:
- How to read: Read from RIGHT to LEFT.
- At the far right ($\lambda$ is large), all coefficients are 0.
- As you move left, $\lambda$ gets smaller, and the penalty is relaxed. Variables “enter” the model one by one as their coefficients become non-zero.
- You can see ‘Rating’ (red-dashed), ‘Student’ (black-solid), and ‘Income’ (blue-dotted) are the first to enter, suggesting they are the most important predictors.
Right Plot (Slide 6) vs. $L_1$ Norm Ratio:
- This is the same plot, just flipped and rescaled. The x-axis is $\|\hat{\beta}_\lambda\|_1 / \|\hat{\beta}_{OLS}\|_1$.
- How to read: Read from LEFT to RIGHT.
- At 0.0: This is a “0% budget” (like $s=0$ or $\lambda=\infty$). All coefficients are 0.
- At 1.0: This is a “100% budget” (like $s=\infty$ or $\lambda=0$). This is the full OLS model.
- This view clearly shows the coefficients “growing” from 0 as their “budget” ($L_1$ Norm) increases.

B. The Code Output (Slide 7) - This is the most important “answer”

This slide explicitly demonstrates variable selection by comparing the coefficients from two different $\lambda$ values.

First Block (The “Optimal” Model):
- bestlam.cv <- cv.out$lambda.min: This gets the $\lambda$ from the bottom of the “U” in the CV plot.
- lasso.conf <- predict(out,type="coefficients",s=bestlam.cv)[1:12,]: This gets the coefficients using that best $\lambda$.
- lasso.conf[lasso.conf!=0]: This R command filters the list to show only the non-zero coefficients.
- Result: The optimal model still keeps 10 variables (‘Income’, ‘Limit’, ‘Rating’, etc.). It has shrunk them, but it hasn’t set many to 0.
Second Block (The “High Penalty” Model):
- The slide text says “if we choose a larger regularization parameter.” Here, they’ve picked an arbitrary larger value, s=10. (Note: R’s predict.glmnet can be confusing; s=10 here means $\lambda=10$).
- lasso.conf <- predict(out,type="coefficients",s=10)[1:12,]: This gets the coefficients using a stronger penalty ($\lambda=10$).
- lasso.conf[lasso.conf!=0]: Again, show only the non-zero coefficients.
- Result: Look! The list is much shorter. The coefficients for ‘Age’, ‘Education’, ‘GenderFemale’, ‘MarriedYes’, and ‘Ethnicity’ are all gone (shrunk to 0.000000). The model has decided these are not important enough to “spend” budget on.

Conclusion: LASSO performs automatic variable selection. By increasing $\lambda$, you create a sparser (simpler) model. Slide 7 is the concrete proof.

Python Equivalents (in more detail)

Here is how you would replicate the entire workflow from the slides in Python.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV, lasso_path
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# --- Assume X_train, y_train, X_test, y_test are loaded ---
# Example: 
# data = pd.read_csv('Credit.csv')
# X = pd.get_dummies(data.drop(['ID', 'Balance'], axis=1), drop_first=True)
# y = data['Balance']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# It's CRITICAL to scale data before regularization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
feature_names = X.columns


# 1. Replicate the CV Plot (Slide 5: ...000200.png)
# LassoCV does what cv.glmnet does: finds the best lambda (alpha)
print("Running LassoCV to find best lambda (alpha)...")
# 'alphas' is the list of lambdas to try. We can let it choose automatically.
# cv=10 means 10-fold cross-validation.
lasso_cv = LassoCV(cv=10, random_state=1, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)

# The best lambda found
best_lambda = lasso_cv.alpha_
print(f"Best lambda (alpha) found: {best_lambda}")

# --- Plotting the CV (MSE vs. Log(Lambda)) ---
# This recreates the R plot
plt.figure(figsize=(10, 6))
# lasso_cv.mse_path_ is a (n_alphas, n_folds) array of MSEs
# We take the mean across the folds (axis=1)
mean_mses = np.mean(lasso_cv.mse_path_, axis=1)
log_lambdas = np.log10(lasso_cv.alphas_)

plt.plot(log_lambdas, mean_mses, 'r.-')
plt.xlabel('Log(Lambda / Alpha)')
plt.ylabel('Mean-Squared Error')
plt.title('LASSO Cross-Validation Path (Replicating R Plot)')
# Plot a vertical line at the best lambda
plt.axvline(np.log10(best_lambda), linestyle='--', color='k', label=f'Best Lambda (alpha) = {best_lambda:.2f}')
plt.legend()
plt.gca().invert_xaxis() # High lambda is on the right in R plot
plt.show()


# 2. Replicate the Coefficient Path Plot (Slide 6: ...000206.png)
# We can use the lasso_path function, or just use the CV object

# The lasso_cv object already calculated the paths!
coefs = lasso_cv.path(X_train_scaled, y_train, alphas=lasso_cv.alphas_)[1].T

plt.figure(figsize=(10, 6))
for i in range(X_train_scaled.shape[1]):
    plt.plot(log_lambdas, coefs[:, i], label=feature_names[i])

plt.xlabel('Log(Lambda / Alpha)')
plt.ylabel('Standardized Coefficients')
plt.title('LASSO Coefficient Path (Replicating R Plot)')
plt.legend(loc='upper right')
plt.gca().invert_xaxis()
plt.show()


# 3. Replicate the Code Output (Slide 7: ...000202.png)
print("\n--- Replicating R Output ---")

# --- First Block: Coefficients with BEST lambda ---
print(f"Coefficients using best lambda (alpha = {best_lambda:.4f}):")
# The lasso_cv object is already fitted with the best lambda
best_coefs = lasso_cv.coef_
coef_series_best = pd.Series(best_coefs, index=feature_names)
# This is like R's `lasso.conf[lasso.conf != 0]`
print(coef_series_best[coef_series_best != 0])


# --- Second Block: Coefficients with a LARGER lambda ---
# Let's pick a larger lambda, e.g., 10 (like the slide)
large_lambda = 10 
lasso_high_penalty = Lasso(alpha=large_lambda)
lasso_high_penalty.fit(X_train_scaled, y_train)

print(f"\nCoefficients using larger lambda (alpha = {large_lambda}):")
high_pen_coefs = lasso_high_penalty.coef_
coef_series_high = pd.Series(high_pen_coefs, index=feature_names)
# This is the second R command: `lasso.conf[lasso.conf != 0]`
print(coef_series_high[coef_series_high != 0])

# --- Final Prediction ---
# This is R's `mean((lasso.pred-y.test)^2)`
y_pred = lasso_cv.predict(X_test_scaled)
test_mse = mean_squared_error(y_test, y_pred)
print(f"\nTest MSE using best lambda: {test_mse:.2f}")

The “Game” of Regularization

First, let’s understand what these plots are showing. This is a “map” of a constrained optimization problem.

The Red Ellipses (RSS Contours): Think of these as contour lines on a topographic map.
- The Center ($\hat{\beta}$): This point is the “bottom of the valley.” It represents the perfect, unconstrained solution—the standard Ordinary Least Squares (OLS) coefficients. This point has the lowest possible Residual Sum of Squares (RSS), or error.
- The Lines: Every point on a single red ellipse has the exact same RSS. As the ellipses get bigger (moving away from the center $\hat{\beta}$), the error gets higher.
The Blue Shaded Area (Constraint Region): This is the “rule” of the game.
- This is our “budget.” We are only allowed to pick a solution ($\beta_1, \beta_2$) from inside or on the boundary of this blue shape.
- LASSO: The constraint is $|\beta_1| + |\beta_2| \le s$. This equation forms a diamond (or a rotated square).
- Ridge: The constraint is $\beta_1^2 + \beta_2^2 \le s$. This equation forms a circle.
The Goal: Find the “best” point that is inside the blue area.
- The “best” point is the one with the lowest possible error (RSS).
- Geometrically, this means we start at the center ($\hat{\beta}$) and expand our ellipse outward. The very first point where the ellipse touches the blue constraint region is our solution.

Why LASSO Performs Variable Selection (The Diamond) 🎯

This is the most important concept. Look at the LASSO diagrams.

The Shape: The LASSO constraint is a diamond.
The Key Feature: This diamond has sharp corners (vertices). And most importantly, these corners lie exactly on the axes.
- The top corner is at $(\beta_1=0, \beta_2=s)$.
- The right corner is at $(\beta_1=s, \beta_2=0)$.
The “Collision”: Now, imagine the red ellipses (representing our error) expanding from the OLS solution ($\hat{\beta}$). They will almost always “hit” the blue diamond at one of its sharp corners.
- Look at your textbook diagram (slide ...000304.png). The ellipse clearly makes contact with the diamond at the top corner, where $\beta_1 = 0$.
- Look at your example (slide ...000259.jpg). The center of the ellipses is at (4, 0.1). The closest point on the diamond that the expanding ellipses will hit is the corner at (2, 0). At this solution, $y$ is exactly 0.

Conclusion: Because the $L_1$ “diamond” has corners on the axes, the optimal solution is very likely to land on one of them. When it does, the coefficient for the other axis is set to exactly zero. This is the variable selection property.

Why Ridge Regression Only Shrinks (The Circle) 🤏

Now, look at the Ridge regression diagram.

The Shape: The Ridge constraint is a circle.
The Key Feature: A circle is perfectly smooth and has no corners.
The “Collision”: Imagine the same ellipses expanding and hitting the blue circle. The contact point will be a tangent point.
- Because the circle is round, this tangent point can be anywhere on its circumference.
- It is extremely unlikely that the contact point will be exactly on an axis (e.g., at $(\beta_1=0, \beta_2=s)$). This would only happen if the OLS solution $\hat{\beta}$ was already perfectly aligned with that axis.
Conclusion: The Ridge solution will find a point where both $\beta_1$ and $\beta_2$ are non-zero. The coefficients are “shrunk” (pulled in from $\hat{\beta}$ towards the origin), but they never become zero. This is why Ridge is called a “shrinkage” method, but not a “variable selection” method.

Summary: Diamond vs. Circle

Feature	LASSO ($L_1$ Norm)	Ridge ($L_2$ Norm)
Constraint Shape	Diamond (or hyper-rhombus)	Circle (or hypersphere)
Key Feature	Sharp corners on the axes	Smooth curve with no corners
Geometric Solution	Ellipses hit the corners	Ellipses hit a smooth part
Result	Forces some coefficients to exactly 0	Shrinks all coefficients towards 0
Name	Variable Selection	Shrinkage

The “space meaning” is that the sharp corners of the $L_1$ diamond are what make variable selection possible. The smooth circle of the $L_2$ norm does not have these corners and thus cannot force coefficients to zero.

8. Shrinkage Methods (Lasso vs. Ridge)

Core Concept: Shrinkage Methods

Both Ridge (L2) and Lasso (L1) are regularization techniques used to improve upon standard Ordinary Least Squares (OLS) regression.

Their main goal is to manage the bias-variance tradeoff. OLS often has low bias but very high variance, especially when you have many predictors ($p$) or when predictors are correlated. Ridge and Lasso improve prediction accuracy by shrinking the regression coefficients towards zero. This adds a small amount of bias but significantly reduces the variance, leading to a lower overall Test Mean Squared Error (MSE).

The Key Difference: Math & How They Shrink

The slides show that the two methods use different penalties, which leads to very different mathematical forms and practical outcomes.

Ridge Regression (L2 Penalty): Minimizes $RSS + \lambda \sum_{j=1}^{p} \beta_j^2$
Lasso Regression (L1 Penalty): Minimizes $RSS + \lambda \sum_{j=1}^{p} |\beta_j|$

Slide 80 provides the exact formulas for their coefficient estimates in a simple, orthogonal case (where predictors are independent):

Ridge Regression (Proportional Shrinkage)

Formula: $\hat{\beta}_j^R = \hat{\beta}_j^{LSE} / (1 + \lambda)$
What this means: Ridge shrinks every least squares coefficient by a proportional amount. It will make coefficients smaller, but it will never set them to exactly zero (unless $\lambda$ is $\infty$).

Lasso Regression (Soft-Thresholding)

Formula: $\hat{\beta}_j^L = \text{sign}(\hat{\beta}_j^{LSE})(|\hat{\beta}_j^{LSE}| - \lambda/2)_+$
What this means: This is a “soft-thresholding” operator.
- If the original coefficient $\hat{\beta}_j^{LSE}$ is small (its absolute value is less than $\lambda/2$), Lasso sets it to exactly zero.
- If the coefficient is large, Lasso subtracts $\lambda/2$ from its absolute value, shrinking it towards zero.
Key Property: Because of this, Lasso performs automatic feature selection by eliminating predictors.

Important Images Explained

Most Important: Figure 6.10 (Slide 82)

This is the best visual for understanding the mathematical difference from the formulas above.

Left (Ridge): The red line shows the Ridge estimate vs. the OLS estimate. It’s a straight, diagonal line with a slope less than 1. It shrinks everything proportionally.
Right (Lasso): The red line shows the Lasso estimate. It’s “flat” at zero for a range, showing it sets small coefficients to zero. Then, it slopes up, but it’s shifted (it shrinks the large coefficients by a fixed amount).

Scenario 1: Figure 6.8 (Slide 76)

This plot shows what happens when all 45 predictors are truly related to the response.

Result (Slide 77): Ridge performs slightly better (has a lower minimum MSE, shown by the dotted purple line).
Why: Lasso’s assumption (that some coefficients are zero) is wrong in this case. By forcing some relevant predictors to zero, it adds too much bias. Ridge, by just shrinking all of them, finds a better balance.

Scenario 2: Figure 6.9 (Slide 78)

This plot shows the opposite scenario: only 2 out of 45 predictors are truly related (a “sparse” model).

Result: Lasso performs much better (its solid purple line has a much lower minimum MSE).
Why: Lasso’s assumption is correct. It successfully sets the 43 “noise” predictors to zero, which dramatically reduces variance, while correctly keeping the 2 important ones.

Python & Code Understanding

The slides don’t contain Python code, but they describe the exact concepts you would use, primarily in scikit-learn.

Implementing Ridge & Lasso:

from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# It's crucial to scale data before regularization
# alpha is the same as the λ (lambda) in your slides

# --- Ridge ---
# The math for Ridge is a "closed-form solution" (Slide 80)
# ridge_model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))

# --- Lasso ---
# Lasso requires a numerical solver (like coordinate descent)
# lasso_model = make_pipeline(StandardScaler(), Lasso(alpha=0.1))

The Soft-Thresholding Formula: The math from Slide 80, $\text{sign}(y)(|y| - \lambda/2)_+$, is the core operation in the “coordinate descent” algorithm used to solve Lasso. You could write it in Python/Numpy:

import numpy as np

def soft_threshold(x, lambda_val):
  """Implements the Lasso soft-thresholding formula."""
  return np.sign(x) * np.maximum(0, np.abs(x) - (lambda_val / 2))

# Example:
# ols_coefficient = 1.5
# threshold = 4.0
# lasso_coefficient = soft_threshold(ols_coefficient, threshold) 
# print(lasso_coefficient) # Output: 0.0

# ols_coefficient = 3.0
# threshold = 4.0
# lasso_coefficient = soft_threshold(ols_coefficient, threshold) 
# print(lasso_coefficient) # Output: 1.0 (it was 3.0, shrunk by 4/2 = 2)

Choosing $\lambda$ (alpha): Slide 79 says to “Use cross validation to determine which one has better prediction.” In scikit-learn, this is done for you with RidgeCV and LassoCV, which automatically test a range of alpha values.

Summary: Lasso vs. Ridge

Feature	Ridge (L2)	Lasso (L1)
Penalty	$L_2$ norm: $\lambda \sum \beta_j^2$	$L_1$ norm: $\lambda \sum \|\beta_j\|$
Coefficient Shrinkage	Proportional; shrinks all coefficients, but never to exactly zero.	Soft-thresholding; can force coefficients to be exactly zero.
Feature Selection?	No	Yes, this is its main advantage.
Interpretability	Less interpretable (keeps all $p$ variables).	More interpretable (produces a “sparse” model with fewer variables).
Best Used When…	…most predictors are useful. (e.g., Slide 76: 45/45 relevant).	…many predictors are “noise” and only a few are strong. (e.g., Slide 78: 2/45 relevant).
Computation	Has a simple, closed-form solution.	Requires numerical optimization (e.g., coordinate descent).

9. Shrinkage Methods (Ridge & LASSO)

Summary of Shrinkage Methods (Ridge & LASSO)

These slides introduce shrinkage methods, also known as regularization, a technique used in regression (like linear regression) to improve model performance. The main idea is to add a penalty to the model’s loss function to “shrink” the size of the coefficients. This helps to reduce model variance and prevent overfitting, especially when you have many features.

The two main methods discussed are Ridge Regression ($L_2$ penalty) and LASSO ($L_1$ penalty).

Key Mathematical Formulas

Standard Linear Model: The problem starts with the standard linear regression model (from slide 1):

\[ \]$$\mathbf{y} = \mathbf{X}\beta + \epsilon

\[ \]$$ * $\mathbf{y}$ is the $n \times 1$ vector of observed outcomes.
- $\mathbf{X}$ is the $n \times p$ matrix of $p$ predictor features for $n$ observations.
- $\beta$ is the $p \times 1$ vector of coefficients (what we want to find).
- $\epsilon$ is the $n \times 1$ vector of random errors.
- The goal of standard “Ordinary Least Squares” (OLS) regression is to find the $\beta$ that minimizes the loss: $\|\mathbf{X}\beta - \mathbf{y}\|^2_2$.
LASSO (L1 Regularization): LASSO (Least Absolute Shrinkage and Selection Operator) adds a penalty based on the absolute value of the coefficients (the $L_1$-norm). This is the key formula from slide 1:

\[ \]$$\hat{\beta}(\lambda) \leftarrow \arg \min_{\beta} \left( |\mathbf{X}\beta - \mathbf{y}|^2_2 + \lambda|\beta|_1 \right)

\[ \]$$ * $\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|$
- $\lambda$ (lambda) is the tuning parameter that controls the strength of the penalty. A larger $\lambda$ means more shrinkage.
- Key Property (Variable Selection): The $L_1$ penalty can force some coefficients ($\beta_j$) to become exactly zero. This means LASSO simultaneously performs feature selection by automatically removing irrelevant predictors.
- Support (Slide 1): The question “Can it recover the support of $\beta$?” is asking if LASSO can correctly identify the set of true non-zero coefficients (defined as $S := \{j : \beta_j \neq 0\}$).
Ridge Regression (L2 Regularization): Ridge regression (mentioned on slide 2, shown on slide 3) adds a penalty based on the squared value of the coefficients (the $L_2$-norm).

\[ \]$$\hat{\beta}(\lambda) \leftarrow \arg \min_{\beta} \left( |\mathbf{X}\beta - \mathbf{y}|^2_2 + \lambda|\beta|^2_2 \right)

\[ \]$$ * $\|\beta\|^2_2 = \sum_{j=1}^{p} \beta_j^2$
- Key Property (Shrinkage): The $L_2$ penalty shrinks coefficients towards zero but never sets them to exactly zero (unless $\lambda = \infty$). It is effective at handling multicollinearity.

Important Images & Concepts

The most important images are the plots from slides 3 and 4. They illustrate the two most critical concepts: how to choose $\lambda$ and what the penalty does to the coefficients.

Tuning Parameter Selection (Slides 3 & 4, Left Plots)

Problem: How do you find the best value for $\lambda$?
Solution: Cross-Validation (CV). The slides show 10-fold CV.
What the Plots Show: The left plots on slides 3 and 4 show the Cross-Validation Error (like MSE) for different values of the penalty.
- The x-axis represents the penalty strength (either $\lambda$ itself or a related measure like the shrinkage ratio $\|\hat{\beta}_\lambda\|_1 / \|\hat{\beta}\|_1$).
- The y-axis is the prediction error.
- The curve is typically U-shaped. The vertical dashed line marks the minimum of this curve. This minimum point corresponds to the optimal $\lambda$, which provides the best balance between bias and variance, leading to the best-performing model on unseen data.

Coefficient Paths (Slides 3 & 4, Right Plots)

These “trace” plots are crucial for understanding the difference between Ridge and LASSO. They show how the value of each coefficient (y-axis) changes as the penalty strength (x-axis) changes.

Slide 3 (Ridge): As $\lambda$ increases (moving right), all coefficient values are smoothly shrunk towards zero, but none of them actually hit zero.
Slide 4 (LASSO): As the penalty increases (moving from right to left, as the ratio $s$ goes from 1.0 to 0.0), you can see coefficients “drop off” and become exactly zero one by one. The model with the optimal $\lambda$ (vertical line) has selected only a few non-zero coefficients (the pink and teal lines), while all the grey lines have been set to zero. This is feature selection in action.

Key Discussion Points (Slide 2)

Non-linear models: You can apply these methods to non-linear models by first creating non-linear features (e.g., $x_1^2$, $x_2^2$, $x_1 \cdot x_2$) and then feeding them into a LASSO or Ridge model. The regularization will then select which of these linear or non-linear terms are important.
Correlated Features (Multicollinearity): The question “If $x_j \approx x_k$, how does LASSO behave?” is a key weakness of LASSO.
- LASSO: Tends to arbitrarily select one of the correlated features and set the others to zero. This can make the model unstable.
- Ridge: Tends to shrink the coefficients of correlated features together, giving them similar (but smaller) values.
- Elastic Net (not shown) is a hybrid of Ridge and LASSO that is often used to get the best of both worlds: it can select groups of correlated variables.

Python Code Understanding (using `scikit-learn`)

Here is how you would implement these concepts in Python.

# Import necessary libraries
import numpy as np
from sklearn.linear_model import Lasso, Ridge, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# --- Assume you have your data ---
# X: your feature matrix (e.g., shape 100, 20)
# y: your target vector (e.g., shape 100,)
# X, y = ... load your data ...

# 1. It's crucial to scale your data before regularization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Find the optimal lambda (alpha) using Cross-Validation
# scikit-learn uses 'alpha' instead of 'lambda' for the tuning parameter.

# --- For LASSO ---
# LassoCV automatically performs cross-validation (e.g., cv=10)
# to find the best alpha.
lasso_cv_model = LassoCV(cv=10, random_state=0)
lasso_cv_model.fit(X_scaled, y)

# Get the best alpha (lambda)
best_alpha_lasso = lasso_cv_model.alpha_
print(f"Optimal alpha (lambda) for LASSO: {best_alpha_lasso}")

# Get the final coefficients
lasso_coeffs = lasso_cv_model.coef_
print(f"LASSO coefficients: {lasso_coeffs}")
# You will see that many of these are exactly 0.0

# --- For Ridge ---
# RidgeCV works similarly. It's often good to test alphas on a log scale.
ridge_alphas = np.logspace(-3, 3, 100) # 100 values from 0.001 to 1000
ridge_cv_model = RidgeCV(alphas=ridge_alphas, store_cv_values=True)
ridge_cv_model.fit(X_scaled, y)

# Get the best alpha (lambda)
best_alpha_ridge = ridge_cv_model.alpha_
print(f"Optimal alpha (lambda) for Ridge: {best_alpha_ridge}")

# Get the final coefficients
ridge_coeffs = ridge_cv_model.coef_
print(f"Ridge coefficients: {ridge_coeffs}")
# You will see these are small, but not exactly zero.

Bias-variance tradeoff

Key Mathematical Formulas & Concepts

LASSO: Sign Consistency

This is the “ideal” scenario for LASSO. Sign consistency means that, with enough data, the LASSO model not only selects the correct set of features (it recovers the “support” $S$) but also correctly identifies the sign (positive or negative) of their coefficients.

The Goal (Slide 1):

\[ \]$$\text{sign}(\hat{\beta}(\lambda)) = \text{sign}(\beta)

\[ \]$$This means the signs of our estimated coefficients $\hat{\beta}(\lambda)$ match the signs of the true underlying coefficients $\beta$.
The “Irrepresentable Condition” (Slide 1): This is the mathematical guarantee required for LASSO to achieve sign consistency.

\[ \]$$|\mathbf{X}_{S^c}\top \mathbf{X}_S (\mathbf{X}_S^\top \mathbf{X}S)^{-1} \text{sign}(\beta_S)|\infty < 1

\[ \]$$ * Plain English: This formula is a complex way of saying: The irrelevant features ($\mathbf{X}_{S^c}$) cannot be too strongly correlated with the true, relevant features ($\mathbf{X}_S$).
- If an irrelevant feature is very similar (highly correlated) to a true feature, LASSO can get “confused” and might pick the wrong one, or its estimate will be unstable. This condition fails.

Ridge Regression: The Bias-Variance Tradeoff

The Formula (Slide 3):

\[ \]$$\hat{\beta}{\text{ridge}}(\lambda) \leftarrow \arg \min{\beta} \left( |\mathbf{y} - \mathbf{X}\beta|^2 + \lambda|\beta|^2 \right)

\[ \]$$(Note: This is the $L_2$ penalty, so $\|\beta\|^2 = \sum \beta_j^2$)
The Problem it Solves: Collinearity (Slide 2) When features are strongly correlated (e.g., $x_i \approx x_j$), regular methods fail:
- LSE (OLS): Fails because the matrix $\mathbf{X}^\top \mathbf{X}$ is “non-invertible” (or singular), so the math for the solution $\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ breaks down.
- LASSO: Fails because the Irrepresentable Condition is violated. LASSO will tend to arbitrarily pick one of the correlated features and set the others to zero.
The Ridge Solution (Slide 3):
1. Always has a solution: Adding the $\lambda$ penalty makes the matrix math work, even if $\mathbf{X}^\top \mathbf{X}$ is non-invertible.
2. Groups variables: This is the key takeaway. Instead of arbitrarily picking one feature, Ridge tends to shrink the coefficients of collinear variables together.
3. Bias-Variance Tradeoff: Ridge introduces bias into the estimates (they are “wrong” on purpose) to massively reduce variance (they are more stable and less sensitive to the specific training data). This trade-off usually leads to a much lower overall error (Mean Squared Error).

Important Images & Key Takeaways

Slide 2 (Collinearity Failures): This is the most important “problem” slide. It clearly explains why you can’t always use standard LSE or LASSO. The fact that all three methods (LSE, LASSO, Forward Selection) fail with strong collinearity motivates the need for Ridge.
Slide 3 (Ridge Properties): This is the most important “solution” slide. The two most critical points are:
- Always unique solution for λ > 0
- Collinear variables tend to be grouped! (This is the “fix” for the problem on Slide 2).

Python Code Understanding

Let’s demonstrate the key difference (Slide 3) in how LASSO and Ridge handle collinear features.

We will create two features, x1 and x2, that are nearly identical.

import numpy as np
from sklearn.linear_model import Lasso, Ridge

# 1. Create a dataset with 2 strongly correlated features
np.random.seed(0)
n_samples = 100
# x1: a standard feature
x1 = np.random.randn(n_samples)
# x2: almost identical to x1
x2 = x1 + 0.01 * np.random.randn(n_samples)

# Combine into our feature matrix X
X = np.c_[x1, x2]

# y: The target variable (let's say y = 2*x1 + 2*x2)
y = 2 * x1 + 2 * x2 + np.random.randn(n_samples)

# 2. Fit LASSO (alpha is the same as lambda)
# We use a moderate alpha
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X, y)

# 3. Fit Ridge (alpha is the same as lambda)
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X, y)

# 4. Compare the coefficients
print("--- Results for Correlated Features ---")
print(f"True Coefficients: [2.0, 2.0]")
print(f"LASSO Coefficients: {np.round(lasso_model.coef_, 2)}")
print(f"Ridge Coefficients: {np.round(ridge_model.coef_, 2)}")

Example Output:

--- Results for Correlated Features ---
True Coefficients: [2.0, 2.0]
LASSO Coefficients: [3.89 0.  ]
Ridge Coefficients: [1.95 1.94]

Code Explanation:

LASSO: As predicted by the slides, LASSO failed to find the true model. It arbitrarily picked x1, gave it a large coefficient, and set x2 to zero. This is unstable and not what we wanted.
Ridge: As predicted by Slide 3, Ridge handled the collinearity perfectly. It identified that both x1 and x2 were important and “grouped” them by assigning them nearly identical, stable coefficients (1.95 and 1.94), which are very close to the true values of 2.0.

10. Elastic Net

Overall Summary

These slides introduce Elastic Net, a modern regression method that solves the major weaknesses of its two predecessors, Ridge and LASSO regression.

Ridge is good for collinearity (correlated features) but can’t do variable selection (it can’t set any feature’s coefficient to exactly zero).
LASSO is good for variable selection (it creates sparse models by setting coefficients to zero) but behaves unstably when features are correlated (it tends to randomly pick one and discard the others).

Elastic Net combines the L1 penalty of LASSO and the L2 penalty of Ridge. The result is a single, flexible model that:

Performs variable selection (like LASSO).
Handles correlated features stably by grouping them together (like Ridge).
Can select more features than samples ($p > n$), which LASSO cannot do.

Slide 1: The Definition and Formula (File: `...020245.png`)

This slide explains why Elastic Net was created and defines it mathematically.

The Problem: It states the exact trade-off:
- “Ridge regression can handle collinearity, but cannot perform variable selection;”
- “LASSO can perform variable selection, but performs poorly when collinearity;”
The Solution (The Formula): The core of the method is this optimization formula: \[\hat{\beta}_{eNet}(\lambda, \alpha) \leftarrow \arg \min_{\beta} \left( \underbrace{\|\mathbf{y} - \mathbf{X}\beta\|^2}_{\text{Loss}} + \lambda \left( \underbrace{\alpha\|\beta\|_1}_{\text{L1 Penalty}} + \underbrace{\frac{1-\alpha}{2}\|\beta\|_2^2}_{\text{L2 Penalty}} \right) \right)\]
Breaking Down the Formula:
- $\|\mathbf{y} - \mathbf{X}\beta\|^2$: This is the standard “Residual Sum of Squares” (RSS). We want to find coefficients ($\beta$) that make the model’s predictions ($X\beta$) as close as possible to the true values ($y$).
- $\lambda$ (Lambda): This is the master knob for total regularization strength. A larger $\lambda$ means a bigger penalty, which “shrinks” all coefficients more.
- $\alpha$ (Alpha): This is the mixing parameter that balances L1 and L2. This is the key innovation.
  - $\alpha\|\beta\|_1$: This is the L1 (LASSO) part. It forces weak coefficients to become exactly zero, thus selecting variables.
  - $\frac{1-\alpha}{2}\|\beta\|_2^2$: This is the L2 (Ridge) part. It shrinks all coefficients and, crucially, encourages correlated features to have similar coefficients (the grouping effect).
The Special Cases:
- If $\alpha = 0$, the L1 term vanishes, and the model becomes pure Ridge Regression.
- If $\alpha = 1$, the L2 term vanishes, and the model becomes pure LASSO Regression.
- If $0 < \alpha < 1$, you get Elastic Net, which “encourages grouping of correlated variables” and “can perform variable selection.”

Slide 2: The Intuition and The Grouping Effect (File: `...020249.jpg`)

This slide gives you the visual intuition and the practical proof of why Elastic Net works. It has two parts.

Part 1: The Three Graphs (Geometric Intuition)

These graphs show the constraint region (the shaded shape) for each penalty. The model tries to find the best coefficients ($\theta_{opt}$), and the final solution (the green dot) is the first point where the cost function (the blue ellipses) “touches” the constraint region.

L1 Norm (LASSO): The region is a diamond. Because of its sharp corners, the ellipses are very likely to hit a corner first. At a corner, one of the coefficients (e.g., $\theta_1$) is zero. This is a visual explanation of how LASSO creates sparsity (variable selection).
L2 Norm (Ridge): The region is a circle. It has no corners. The ellipses will hit a “smooth” point on the circle, shrinking both coefficients ($\theta_1$ and $\theta_2$) but not setting either to zero. This is weight sharing.
L1 + L2 (Elastic Net): The region is a “rounded square”. It’s the perfect compromise.
- It has “corners” (like LASSO) so it can still set coefficients to zero.
- It has “curved edges” (like Ridge) so it’s more stable and handles correlated variables by finding a solution on an edge rather than a single sharp corner.

Part 2: The Formula (The Grouping Effect)

The text at the bottom explains Elastic Net’s “grouping effect.”

The Implication: “If $x_j \approx x_k$, then $\hat{\beta}_j \approx \hat{\beta}_k$.”
Meaning: If two features ($x_j$ and $x_k$) are highly correlated (their values are very similar), Elastic Net will force their coefficients ($\hat{\beta}_j$ and $\hat{\beta}_k$) to also be very similar.
Why this is good: This is the opposite of LASSO. LASSO would be unstable and might arbitrarily set $\hat{\beta}_j$ to a large value and $\hat{\beta}_k$ to zero. Elastic Net “groups” them: it will either keep both in the model with similar importance, or it will shrink both of them out of the model together. This is a much more stable and realistic result.
The Warning: “LASSO may be unstable in this case!” This directly highlights the problem that Elastic Net solves.

Slide 3: The Feature Comparison Table (File: `...020255.png`)

This table is your “cheat sheet” for choosing the right model. It compares Ridge, LASSO, and Elastic Net on all their key properties.

Penalty: Shows the L2, L1, and combined penalties.
Sparsity: Can the model set coefficients to 0?
- Ridge: No ❌
- LASSO: Yes ✅
- Elastic Net: Yes ✅
Variable Selection: This is a crucial row.
- LASSO: Yes ✅, BUT it has a major limitation: if you have more features than samples ($p > n$), LASSO can select at most $n$ features.
- Elastic Net: Yes ✅, and it can select more than $n$ variables. This makes it the clear choice for “wide” data problems (e.g., in genomics, where $p=20,000$ features and $n=100$ samples).
Grouping Effect: How does it handle correlated features?
- Ridge: Strong ✅
- LASSO: Weak ❌ (it “picks one”)
- Elastic Net: Strong ✅
Solution Uniqueness: Is the answer stable?
- Ridge: Always ✅
- LASSO: No ❌ (not if $X$ is “rank-deficient,” e.g., $p > n$ or correlated features)
- Elastic Net: Always ✅ (as long as $\alpha < 1$, the Ridge component guarantees a unique, stable solution).
Use Case: When should you use each?
- Ridge: For prediction, especially with multicollinearity.
- LASSO: For interpretability and creating sparse models (when you think only a few features matter).
- Elastic Net: The best all-arounder. Use it for correlated predictors, when $p \gg n$, or when you need both sparsity + stability.

Code Understanding (Python `scikit-learn`)

When you use this in Python, be aware of a common confusion in the parameter names:

Concept (from your slides)	`scikit-learn` Parameter	Description
$\lambda$ (Lambda)	`alpha`	The overall strength of regularization.
$\alpha$ (Alpha)	`l1_ratio`	The mixing parameter between L1 and L2.

Example: An l1_ratio of 0 is Ridge. An l1_ratio of 1 is LASSO. An l1_ratio of 0.5 is a 50/50 mix.

from sklearn.linear_model import ElasticNet, ElasticNetCV

# 1. Initialize a specific model
# This uses 0.5 for lambda (slide's alpha) and 0.1 for lambda (slide's lambda)
model = ElasticNet(alpha=0.1, l1_ratio=0.5)

# 2. A much better way: Find the best parameters automatically
# This will test l1_ratios of 0.1, 0.5, and 0.9
# and automatically find the best 'alpha' (strength) for each.
cv_model = ElasticNetCV(
    l1_ratio=[.1, .5, .9],
    cv=5  # 5-fold cross-validation
)

# 3. Fit the model to your data (X_train, y_train)
# cv_model.fit(X_train, y_train)

# 4. See the best parameters it found
# print(f"Best l1_ratio (slide's alpha): {cv_model.l1_ratio_}")
# print(f"Best alpha (slide's lambda): {cv_model.alpha_}")

11. High-Dimensional Data Analysis

The Core Problem: Large $p$, Small $n$

The slides introduce the challenge of high-dimensional data, which is defined by having many more features (predictors) $p$ than observations (samples) $n$. This is often written as $p \gg n$.

Example: Predicting blood pressure (the response $y$) using millions of genetic markers (SNPs) as features $X$, but only having data from a few hundred patients.
Troubles:
- Overfitting: Models become “too flexible” and learn the noise in the training data, rather than the true underlying pattern.
- Non-Unique Solution: When $p > n$, the standard least squares linear regression model doesn’t even have a unique solution.
- Misleading Metrics: This leads to a common symptom: a very small training error (or high $R^2$) but a very large test error.

Most Important Image: The Overfitting Trap (Figure 6.23)

Figure 6.23 (from the first uploaded image) is the most critical visual for understanding the problem. It shows what happens when you add features (variables) that are completely unrelated to the outcome.

Left Plot (R²): The $R^2$ on the training data increases towards 1. This looks like a perfect fit.
Center Plot (Training MSE): The Mean Squared Error on the training data decreases to 0. This also looks perfect.
Right Plot (Test MSE): The Mean Squared Error on the test data (new, unseen data) explodes. This reveals the model is garbage and has just memorized the training set.

⚠️ This is the key takeaway: In high dimensions, $R^2$ and training MSE are useless and misleading metrics for model quality.

The Solution: Regularization & Model Selection

To combat overfitting, we must use less flexible models. The main strategy is regularization (also called shrinkage), which involves adding a penalty term to the cost function to “shrink” the model coefficients ($\beta$).

Mathematical Formulas & Python Code 🐍

The standard Least Squares cost function you try to minimize is: \[\text{RSS} = \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 \quad \text{or} \quad \|y - X\beta\|^2_2\] This fails when $p > n$. The solutions modify this:

A. Ridge Regression ($L_2$ Penalty)

Concept: Shrinks all coefficients towards zero, but never to zero. It’s good when many features are related to the outcome.
Math Formula: \[\text{Minimize: } \left( \|y - X\beta\|^2_2 + \lambda \sum_{j=1}^p \beta_j^2 \right)\]
- The $\lambda \sum_{j=1}^p \beta_j^2$ is the $L_2$ penalty.
- $\lambda$ (lambda) is a tuning parameter that controls the penalty strength. A larger $\lambda$ means more shrinkage.

Python (Scikit-learn):

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# alpha is the lambda (λ) tuning parameter
# We find the best alpha using cross-validation
ridge_model = Ridge(alpha=1.0) 

# Fit the model
ridge_model.fit(X_train, y_train)

# Evaluate using test error (e.g., MSE on test set)
# NOT with training R-squared
test_score = ridge_model.score(X_test, y_test)

B. The Lasso ($L_1$ Penalty)

Concept: This is a very important method. The $L_1$ penalty can force coefficients to be exactly zero. This means Lasso performs automatic feature selection, creating a sparse model.
Math Formula: \[\text{Minimize: } \left( \|y - X\beta\|^2_2 + \lambda \sum_{j=1}^p |\beta_j| \right)\]
- The $\lambda \sum_{j=1}^p |\beta_j|$ is the $L_1$ penalty.
- Again, $\lambda$ is the tuning parameter.

Python (Scikit-learn):

from sklearn.linear_model import Lasso

# alpha is the lambda (λ) tuning parameter
lasso_model = Lasso(alpha=0.1) 

# Fit the model
lasso_model.fit(X_train, y_train)

# The model automatically selects features
# Coefficients that are zero were 'dropped'
print(lasso_model.coef_)

C. Other Methods

The slides also mention:

Forward Stepwise Selection: A different approach where you start with no features and add them one by one, picking the one that improves the model most (based on a criterion like cross-validation error).
Principal Components Regression (PCR): A dimensionality reduction technique.

The Curse of Dimensionality (Figure 6.24)

This example (Figures 6.24 and its description) shows a more subtle problem.

Setup: A model with $n=100$ observations and 20 true features.
Plots: They test Lasso by adding more and more irrelevant features:
- $p=20$ (Left): Lasso performs well. The lowest test MSE is found with minimal regularization.
- $p=50$ (Center): Lasso still works well, but it needs more regularization (a smaller “Degrees of Freedom”) to filter out the 30 junk features.
- $p=2000$ (Right): This is the curse of dimensionality. Even with a good method like Lasso, the 1,980 irrelevant features add so much noise that the model performs poorly regardless of the tuning parameter. The true signal is “lost in the noise.”

Summary: Cautions for $p > n$

The final slide gives the most important rules to follow:

Beware Extreme Multicollinearity: When $p > n$, your features are mathematically guaranteed to be linearly related, which breaks standard regression.
Don’t Overstate Results: A model you find (e.g., with Lasso) is just one of many potentially good models.
🚫 DO NOT USE training $R^2$, $p$-values, or training MSE to justify your model. As Figure 6.23 showed, they are misleading.
✅ DO USE test error and cross-validation error to choose your model and assess its performance.

The Core Problem: $p \gg n$ (The “Troubles” Slide)

This slide (filename: ...020259.png) sets up the entire problem. The issue isn’t just “overfitting”; it’s a fundamental mathematical breakdown of standard methods.

“Large $p$ makes our linear regression model too flexible”: This is an understatement. It leads to a problem called an underdetermined system.
“If $p > n$, the LSE is not even uniquely determined”: This is the most important technical point.
- Mathematical Reason: The standard solution for Ordinary Least Squares (OLS) is $\hat{\beta} = (X^T X)^{-1} X^T y$.
- $X$ is the data matrix with $n$ rows (observations) and $p$ columns (features).
- The matrix $X^T X$ has dimensions $p \times p$.
- When $p > n$, the $X^T X$ matrix is singular, which means its determinant is zero and it cannot be inverted. The $(X^T X)^{-1}$ term does not exist.
- “Extreme multicollinearity” (from slide ...020744.png) is the direct cause. When $p > n$, the columns of $X$ (the features) are guaranteed to be linearly dependent. There are infinite combinations of the features that can explain the data.

The Simplest Example: $n=2$ (Figure 6.22)

This slide (filename: ...020728.png) is the perfect illustration of the “not uniquely determined” problem.

Left Plot (Low-D): Many points ($n$), only two parameters ($p=2$: intercept $\beta_0$ and slope $\beta_1$). The line is a “best fit” that balances the errors. The training error (RSS) is non-zero.
Right Plot (High-D): We have $n=2$ observations and $p=2$ parameters.
- You have two equations (one for each point) and two unknowns ($\beta_0$ and $\beta_1$).
- The model has exactly enough flexibility to pass perfectly through both points.
- The result is zero training error.
- This “perfect” fit is an illusion. If you got a new data point, this line would almost certainly be a terrible predictor. This is the essence of overfitting.

The Consequence: Misleading Metrics (Figure 6.23)

This slide (filename: ...020730.png) scales up the problem from $n=2$ to $n=20$ and shows why you must be cautious.

The Setup: $n=20$ observations. We start with 1 feature and add more and more irrelevant, junk features.
Left Plot ($R^2$): The $R^2$ on the training data steadily increases towards 1 as we add features. This is because, by pure chance, each new junk feature can explain a tiny bit more of the noise in the training set.
Center Plot (Training MSE): The training error drops to 0. This is the same as the $n=2$ plot. Once the number of features ($p$) gets close to the number of observations ($n=20$), the model can perfectly fit the 20 data points, even if the features are random noise.
Right Plot (Test MSE): This is the “truth.” The actual error on new, unseen data gets worse and worse. By adding noise features, we are just “memorizing” the training set, and our model’s ability to generalize is destroyed.
Key Lesson: (from slide ...020744.png) This is why you must “Avoid using… $p$-values, $R^2$, or other traditional measures of model on training as evidence of good fit.” They are guaranteed to lie to you when $p > n$.

The Solutions (The “Deal with…” Slide)

This slide (filename: ...020734.png) lists the strategies to fix this. The core idea is regularization (or shrinkage). We add a “penalty” to the cost function to stop the $\beta$ coefficients from getting too large or too numerous.

A. Ridge Regression ($L_2$ Penalty)

Concept: Keeps all $p$ features, but shrinks their coefficients. It’s excellent for handling multicollinearity.
Math: $\text{Minimize: } \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 + \lambda \sum_{j=1}^p \beta_j^2$
- The first part is the standard RSS.
- The $\lambda \sum \beta_j^2$ is the $L_2$ penalty. It punishes large coefficient values.
$\lambda$ (Lambda): This is the tuning parameter.
- If $\lambda=0$, it’s just OLS (which fails).
- If $\lambda \to \infty$, all $\beta$’s are shrunk to 0.
- The right $\lambda$ is chosen via cross-validation.

B. The Lasso ($L_1$ Penalty)

Concept: This is often preferred because it performs automatic feature selection. It shrinks many coefficients to be exactly zero.
Math: $\text{Minimize: } \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 + \lambda \sum_{j=1}^p |\beta_j|$
- The $\lambda \sum |\beta_j|$ is the $L_1$ penalty. This absolute value penalty is what allows coefficients to become exactly 0.
Benefit: The final model is sparse (e.g., it might say “out of 2,000 features, only these 15 matter”).

C. Tuning Parameter Choice (The Real Work)

How do you pick the best $\lambda$? You must use the data you have. The slides mention this and “cross validation error” (from ...020744.png).

Python Code (Scikit-learn): You don’t just guess alpha (which is $\lambda$ in scikit-learn). You use a tool like LassoCV or GridSearchCV to find the best one.

from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression

# Create a high-dimensional dataset
X, y = make_regression(n_samples=100, n_features=500, n_informative=10, noise=0.1)

# LassoCV automatically performs cross-validation to find the best alpha (lambda)
# cv=10 means 10-fold cross-validation
lasso_cv_model = LassoCV(cv=10, random_state=0, max_iter=10000)

# Fit the model
lasso_cv_model.fit(X, y)

# This is the best lambda (alpha) it found:
print(f"Best alpha (lambda): {lasso_cv_model.alpha_}")

# You can now see the coefficients
# Most of the 500 coefficients will be 0.0
print(f"Number of non-zero features: {np.sum(lasso_cv_model.coef_ != 0)}")

A Final Warning: The Curse of Dimensionality (Figure 6.24)

This final set of slides (filenames: ...020738.png and ...020741.jpg) provides a crucial, subtle warning: Regularization is not magic.

The Setup: $n=100$ observations. There are 20 real features that truly affect the response.
The Experiment: They run Lasso three times, adding more and more noise features:
- Left Plot ($p=20$): All 20 features are real. The lowest test MSE is found with minimal regularization (high “Degrees of Freedom,” meaning many non-zero coefficients). This makes sense; you want to keep all 20 real features.
- Center Plot ($p=50$): Now we have 20 real features + 30 noise features. Lasso still works! The best model is found with more regularization (fewer “Degrees of Freedom”). Lasso successfully “zeroed out” many of the 30 noise features.
- Right Plot ($p=2000$): This is the curse of dimensionality. We have 20 real features + 1980 noise features. The noise has completely overwhelmed the signal. Lasso fails. The test MSE is high no matter what tuning parameter you choose. The model cannot distinguish the 20 real features from the 1980 junk ones.

Final Takeaway: Even with advanced methods like Lasso, if your $p \gg n$ problem is too extreme (i.S. the signal-to-noise ratio is too low), it may be impossible to build a good predictive model.

The Goal: “Collaborative Filtering”

The first slide (...021218.png) uses the term Collaborative Filtering. This is the key concept. The model “collaborates” by using the ratings of all users to fill in the blanks for a single user.

How it works: The model assumes your “taste” (vector $\mathbf{u}_i$) can be described as a combination of $r$ “latent features” (e.g., $r=3$: % action, % comedy, % drama). It also assumes each movie (vector $\mathbf{v}_j$) has a profile on these same features.
Your predicted rating for a movie is the dot product of your taste vector and the movie’s feature vector.
The model finds the best “taste” vectors $\mathbf{U}$ and “movie” vectors $\mathbf{V}$ that explain all the known ratings simultaneously. It’s collaborative because Lee’s ratings help define the features of “Bullet Train” ($\mathbf{v}_2$), which in turn helps predict Yang’s rating for that same movie.

The Hard Problem (and its 2 Flavors)

The second slide (...021222.png) presents the intuitive, but computationally very hard, way to frame the problem.

Detail 1: Noise vs. No Noise

The slide shows $\mathbf{Y} = \mathbf{M} + \mathbf{E}$. This is critical. * $\mathbf{M}$ is the “true,” “clean,” underlying low-rank matrix of everyone’s “true” preferences. * $\mathbf{E}$ is a matrix of random noise. (e.g., your true rating is 4.3, but you entered a 4; or you were in a bad mood and rated a 3). * $\mathbf{Y}$ is the noisy data we actually observe.

Because of this noise, we don’t expect to find a matrix $\mathbf{N}$ that perfectly matches our data. Instead, we try to find a low-rank $\mathbf{N}$ that is as close as possible. This leads to the formula: \[\underset{\text{rank}(\mathbf{N}) \le r}{\text{minimize}} \quad \left\| \mathcal{P}_{\mathcal{O}}(\mathbf{Y} - \mathbf{N}) \right\|_{\text{F}}^2\] This says: “Find a matrix $\mathbf{N}$ (of rank $r$ or less) that minimizes the sum of squared errors only on the ratings we observed ($\mathcal{O}$).”

Detail 2: Why is $\text{rank}(\mathbf{N}) \le r$ a “Non-convex constraint”?

This is the “difficult to optimize” part. A convex problem is (simplistically) one with a single valley, making it easy to find the single lowest point. A non-convex problem has many local valleys, and an algorithm can get stuck in a “pretty good” valley instead of the “best” one.

The rank constraint is non-convex. For example, the average of two rank-1 matrices is not necessarily a rank-1 matrix (it could be rank-2). This lack of a “smooth valley” property makes the problem NP-hard.

Detail 3: The Number of Parameters: $r(d_1 + d_2)$

The slide asks, “how many entries are needed?” The answer is based on the number of unknown parameters. * A rank-$r$ matrix $\mathbf{M}$ can be factored into $\mathbf{U}$ (which is $d_1 \times r$) and $\mathbf{V}^T$ (which is $r \times d_2$). * The number of entries in $\mathbf{U}$ is $d_1 \times r$. * The number of entries in $\mathbf{V}$ is $d_2 \times r$. * Total “unknowns” to solve for: $d_1 r + d_2 r = r(d_1 + d_2)$. * This means we must have at least $r(d_1 + d_2)$ observed ratings to have any hope of uniquely solving for $\mathbf{U}$ and $\mathbf{V}$. If our number of observations $|\mathcal{O}|$ is less than this, the problem is hopelessly underdetermined.

The “Magic” Solution: Convex Relaxation

The final slide (...021225.png) presents the groundbreaking solution from Candès and Recht. This solution cleverly changes the problem to one that is convex and solvable.

Detail 1: The L1-Norm Analogy (This is the most important concept)

This is the key to understanding why this works.

In Vectors (Lasso):
- Hard Problem: Find the sparsest vector $\beta$ (fewest non-zeros). This is $L_0$ norm, $\text{minimize } \|\beta\|_0$. This is non-convex.
- Easy Problem: Minimize the $L_1$ norm, $\text{minimize } \|\beta\|_1 = \sum |\beta_j|$. This is convex, and it’s a “relaxation” that also produces sparse solutions.
In Matrices (Matrix Completion):
- Hard Problem: Find the lowest-rank matrix $\mathbf{X}$. Rank is the number of non-zero singular values. This is $\text{minimize } \text{rank}(\mathbf{X})$. This is non-convex.
- Easy Problem: Minimize the Nuclear Norm, $\text{minimize } \|\mathbf{X}\|_* = \sum \sigma_i(\mathbf{X})$ (where $\sigma_i$ are the singular values). This is convex, and it’s the “matrix equivalent” of the $L_1$ norm. It’s a relaxation that also produces low-rank solutions.

Detail 2: Noiseless vs. Noisy (Again)

Notice the constraint in this new problem: \[\text{Minimize } \quad \|\mathbf{X}\|_*\] \[\text{Subject to } \quad X_{ij} = M_{ij}, \quad (i, j) \in \mathcal{O}\]

This formulation is for the noiseless case. It assumes the $M_{ij}$ we observed are perfectly accurate. It demands that our solution $\mathbf{X}$ exactly matches the known ratings. This is different from the optimization problem on the previous slide, which just tried to get close to the noisy data $\mathbf{Y}$.

(In practice, you solve a noisy-aware version that combines both ideas, but the slide shows the original, “exact completion” problem.)

Detail 3: The Guarantee (What the math at the bottom means)

\[\text{If } \mathcal{O} \text{ is randomly sampled and } |\mathcal{O}| \gg r(d_1+d_2)\log(d_1+d_2), \text{... then the solution is unique and } \mathbf{M} \text{...}\]

This is the punchline. The Candès paper proved that if you have enough (but still very few) randomly sampled ratings, solving this easy convex problem (minimizing the nuclear norm) will magically give you the exact, true, low-rank matrix $\mathbf{M}$.

$|\mathcal{O}| \gg r(d_1+d_2)$: This part makes sense. We need at least as many observations as our $r(d_1+d_2)$ degrees of freedom.
$\log(d_1+d_2)$: This “log” factor is the “price” we pay for not knowing where the information is. It’s an astonishingly small price.
Example: For a 1,000,000 user x 10,000 movie matrix (like Netflix) with $r=10$, you don’t need $\approx 10^{10}$ ratings. You need a number closer to $10 \times (10^6 + 10^4) \times \log(\dots)$, which is dramatically smaller. This is why this method is practical.

MSDM 5054 - Statistical Machine Learning-L5

发表于 2025-10-06 更新于 2025-10-19 分类于 Machine Learning

统计机器学习Lecture-5

Lecturer: Prof.XIA DONG

1. Resampling

Resampling as a statistical tool to assess the accuracy of models whose main goal is to estimate the test error (a model’s performance on new, unseen data) because the training error is overly optimistic due to overfitting.

重采样是一种统计工具，用于评估模型的准确性，其主要目标是估计测试误差（模型在新的、未见过的数据上的表现），因为由于过拟合导致训练误差过于乐观。

Key Concepts

Resampling: The process of repeatedly drawing samples from a dataset. The two main types mentioned are Cross-validation (to estimate model test error) and Bootstrap (to quantify the uncertainty of estimates). 从数据集中反复抽取样本的过程。主要提到的两种类型是交叉验证（用于估计模型测试误差）和自举（用于量化估计的不确定性）。
Data Splitting (Ideal Scenario): In a “data-rich” situation, you split your data into three parts: **在“数据丰富”的情况下，您可以将数据拆分为三部分：
1. Training Data: Used to fit and train the parameters of various models.用于拟合和训练各种模型的参数。
2. Validation Data: Used to assess the trained models, tune hyperparameters (e.g., choose the polynomial degree), and select the best model. This helps prevent overfitting.用于评估已训练的模型、调整超参数（例如，选择多项式的次数）并选择最佳模型。这有助于防止过度拟合。
3. Test Data: Used only once on the final, selected model to get an unbiased estimate of its real-world performance. 在最终选定的模型上仅使用一次，以获得其实际性能的无偏估计。
Validation vs. Test Data: The slides emphasize this difference (Slide 7). The validation set is part of the model-building and selection process. The test set is kept separate and is only used for the final report card after all decisions are made.验证集是模型构建和选择过程的一部分。测试集是独立的，仅在所有决策完成后用于最终报告。

The Validation Set Approach

This is the simplest cross-validation method.这是最简单的交叉验证方法。

Split: The total dataset is randomly divided into two parts: a training set and a validation set (often a 50/50 or 70/30 split).将整个数据集随机分成两部分：训练集和验证集（通常为 50/50 或 70/30 的比例）。
Train: Various models are fit only on the training set.各种模型仅在训练集上进行拟合。
Validate: The performance of each trained model is evaluated using the validation set. 使用验证集评估每个训练模型的性能。
Select: The model with the best performance (e.g., the lowest error) on the validation set is chosen as the final model. 选择在验证集上性能最佳（例如，误差最小）的模型作为最终模型。

Important Image: Schematic (Slide 10)

This diagram clearly shows a set of $n$ observations being randomly split into a training set (blue, with observations 7, 22, 13) and a validation set (beige, with observation 91). The model learns from the blue set and is tested on the beige set. 此图清晰地展示了一组 $n$ 个观测值被随机分成训练集（蓝色，观测值编号为 7、22、13）和验证集（米色，观测值编号为 91）。模型从蓝色数据集进行学习，并在米色数据集上进行测试。

Example: Auto Data (Formulas & Code)

The slides use the Auto dataset to decide the best polynomial degree to predict mpg from horsepower.

Mathematical Models

The models being compared are polynomials of different degrees. For example:

Linear: $mpg = \beta_0 + \beta_1(horsepower)$
Quadratic: $mpg = \beta_0 + \beta_1(horsepower) + \beta_2(horsepower)^2$
Cubic: $mpg = \beta_0 + \beta_1(horsepower) + \beta_2(horsepower)^2 + \beta_3(horsepower)^3$
线性：$mpg = \beta_0 + \beta_1(马力)$
二次：$mpg = \beta_0 + \beta_1(马力) + \beta_2(马力)^2$
三次：$mpg = \beta_0 + \beta_1(马力) + \beta_2(马力)^2 + \beta_3(马力)^3$

The performance metric used is the Mean Squared Error (MSE) on the validation set: 使用的性能指标是验证集上的均方误差 (MSE)： \[MSE_{val} = \frac{1}{n_{val}} \sum_{i \in val} (y_i - \hat{f}(x_i))^2\] where $n_{val}$ is the number of observations in the validation set, $y_i$ is the true mpg value, and $\hat{f}(x_i)$ is the model’s prediction for the $i$-th observation in the validation set. 其中 $n_{val}$ 是验证集中的观测值数量， $y_i$ 是真实的 mpg 值，$\hat{f}(x_i)$ 是模型对验证集中第 $i$ 个观测值的预测。 ### Important Image: Polynomial Fits (Slide 8) 多项式拟合（幻灯片 8）

This plot is crucial. It shows the Auto data with linear (red), quadratic (green), and cubic (blue) regression lines. * The linear fit is clearly poor. * The quadratic and cubic fits follow the data’s curve much better. * The inset box shows the MSE calculated on the full dataset (this is training MSE): * Linear MSE: ~26.42 * Quadratic MSE: ~21.60 * Cubic MSE: ~21.51 This suggests a non-linear fit is necessary, but it doesn’t tell us which one will generalize better.

这张图至关重要。它用线性（红色）、二次（绿色）和三次（蓝色）回归线展示了 Auto 数据。 * 线性拟合 明显较差。 * 二次和三次拟合 更能贴合数据曲线。 * 插图显示了基于 完整数据集 计算的均方误差（这是训练均方误差）： * 线性均方误差：~26.42 * 二次均方误差：~21.60 * 三次均方误差：~21.51 这表明非线性拟合是必要的，但它并没有告诉我们哪种拟合方式的泛化效果更好。 ### Code Analysis

The slides show two different approaches in code:

1. Python Code (Slide 9): Model Selection Criteria

What it does: This Python code (using pandas and statsmodels) does not implement the validation set approach. Instead, it fits polynomial models (degrees 1 through 5) to the entire dataset.
How it works: It calculates statistical criteria like BIC, Mallow’s $C_p$, and Adjusted $R^2$. These are mathematical adjustments to the training error that estimate the test error without needing a validation set. 它计算统计标准，例如BIC、Mallow 的 $C_p$** 和调整后的 $R^2$。这些是对训练误差的数学调整，无需验证集即可估算测试误差。
Key line (logic): sm.OLS(y, X).fit() is used to fit the model, and then metrics like model.bic and model.rsquared_adj are extracted.
Result: The table shows that the model with [horsepower, horsepower2] (quadratic) has the lowest BIC and $C_p$ values, suggesting it’s the best model according to these criteria.
结果：表格显示，带有 [马力, 马力2]（二次函数）的模型具有最低的 BIC 和 $C_p$ 值，这表明根据这些标准，它是最佳模型。

2. R Code (Slides 14 & 15): The Validation Set Approach

What it does: This R code directly implements the validation set approach described on Slide 13.
How it works:
1. set.seed(...): Sets a random seed to make the split reproducible.
2. train=sample(392, 196): Randomly selects 196 indices (out of 392) to be the training set.
3. lm.fit=lm(mpg~poly(horsepower, 2), ..., subset=train): Fits a quadratic model only using the train data.
4. mean((mpg-predict(lm.fit,Auto))[-train]^2): This is the key calculation.
  - predict(lm.fit, Auto): Predicts mpg for all data.
  - [-train]: Selects only the predictions for the validation set (the data not in train).
  - mean(...): Calculates the MSE on the validation set.
Result: The code is run three times with different seeds (1, 2022, 1997).
- Seed 1: Quadratic MSE (18.71) is lowest.
- Seed 2022: Quadratic MSE (19.70) is lowest.
- Seed 1997: Quadratic MSE (19.08) is lowest.
Main Takeaway: In all random splits, the quadratic model gives the lowest validation set MSE. This provides evidence that the quadratic model is the best choice for generalizing to new data. The fact that the MSE values change with each seed also highlights a key disadvantage of this simple method: the results can be variable depending on the random split. 主要结论：在所有随机拆分中，**二次模型的验证集 MSE 最低。这证明了二次模型是推广到新数据的最佳选择。MSE 值随每个种子变化的事实也凸显了这种简单方法的一个关键缺点：结果可能会因随机拆分而变化。

2. The Validation Set Approach 验证集方法

This method is a simple way to estimate a model’s performance on new, unseen data (the “test error”). 这种方法是一种简单的方法，用于评估模型在新的、未见过的数据（“测试误差”）上的性能。 The core idea is to randomly split your available data into two parts: 其核心思想是将可用数据随机拆分为两部分： 1. Training Set: Used to fit (or “train”) your model. 用于拟合（或“训练”）模型。 2. Validation Set (or Test Set): Used to evaluate the trained model’s performance. You calculate the error (like Mean Squared Error) on this set. 用于评估训练后的模型性能。计算此集合的误差（例如均方误差）。

Python Code Explained (Slide 1)

The first slide shows a Python example using the Auto dataset to predict mpg from horsepower.

Setup & Data Loading:
- import statements load libraries like pandas (for data), sklearn.model_selection.train_test_split (the key function for this method), and sklearn.linear_model.LinearRegression.
- Auto = pd.read_csv(...) loads the data.
- X = Auto['horsepower'].values and y = Auto['mpg'].values select the variables of interest.
The Split:
- X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=007)
- This is the most important line for this method. It splits the data X and y into training and testing (validation) sets.
- train_size=0.5 means 50% of the data is for training and 50% is for validation.
- random_state=007 ensures the split is “random” but “reproducible” (using the same seed 007 will always produce the same split).
Model Fitting & Evaluation:
- The code fits three different polynomial models, but it only uses the training data (X_train, y_train) to do so.
- Linear (Degree 1): A simple LinearRegression.
- Quadratic (Degree 2): Uses PolynomialFeatures(2) to create $x$ and $x^2$ terms, then fits a linear model to them.
- Cubic (Degree 3): Uses PolynomialFeatures(3) to create $x$, $x^2$, and $x^3$ terms.
- It then calculates the Mean Squared Error (MSE) for all three models using the test data (X_test, y_test).
Results (from the text on the slide):
- Linear MSE: $\approx 23.3$
- Quadratic MSE: $\approx 19.4$
- Cubic MSE: $\approx 19.4$
- Conclusion: The quadratic model gives a significantly lower error than the linear model. The cubic model does not offer any real improvement over the quadratic one.
结果（来自幻灯片上的文字）：
- 线性均方误差：约 23.3
- 二次均方误差：约 19.4
- 三次均方误差：约 19.4
- 结论：二次模型的误差显著低于线性模型。三次模型与二次模型相比并没有任何实质性的改进。

Key Images: The Problem with a Single Split

The most important images are on slide 9 (labeled “Figure” and “Page 20”).

Plot on the Left (Single Split): This graph shows the validation MSE for polynomial degrees 1 through 10, based on the single random split from the R code (slide 2). Just like the Python example, it shows that the MSE drops sharply from degree 1 to 2, and then stays relatively low. Based on this one chart, you might pick degree 2 (quadratic) as the best model.

**此图显示了多项式次数为 1 至 10 的验证均方误差，基于 R 代码（幻灯片 2）中的单次随机分割。与 Python 示例一样，它显示 MSE 从 1 阶到 2 阶急剧下降，然后保持在相对较低的水平。基于这张一图，您可能会选择 2 阶（二次）作为最佳模型。

Plot on the Right (Ten Splits): This is the most critical plot. It shows the results of repeating the entire process 10 times, each with a new random split (from R code on slide 3).
- You can see 10 different error curves.
- While they all agree that degree 1 (linear) is bad, they do not agree on the best model. Some curves suggest degree 2 is best, others suggest 3, 4, or even 6.
这是最关键的图表**。它显示了重复整个过程 10 次的结果，每次都使用新的随机分割（来自幻灯片 3 上的 R 代码）。
- 您可以看到 10 条不同的误差曲线。
- 虽然他们都认为 1 阶（线性）模型不好，但他们对最佳模型的看法并不一致。有些曲线表明 2 阶最佳，而另一些则表明 3 阶、4 阶甚至 6 阶最佳。

Summary of Drawbacks (Slides 7, 8, 9, 23, 25)

The slides repeatedly emphasize the two main drawbacks of this simple validation set approach:

High Variability 高变异性: The estimated test MSE can be highly variable, depending on which observations happen to land in the training set versus the validation set. The plot with 10 curves (slide 9, right) proves this perfectly. 估计的测试 MSE 可能高度变异，具体取决于哪些观测值恰好落在训练集和验证集中。包含 10 条曲线的图表（幻灯片 9，右侧）完美地证明了这一点。
Overestimation of Test Error 高估测试误差:
- The model is only trained on a subset (e.g., 50%) of the available data. The validation data is “wasted” and not used for model building.
- Statistical methods tend to perform worse when trained on fewer observations.
- Therefore, the model trained on just the training set is likely worse than a model trained on the entire dataset.
- This “worse” model will have a higher error rate on the validation set. This means the validation set MSE tends to overestimate the true test error you would get from a model trained on all your data.
- 该模型仅基于可用数据的子集（例如 50%）进行训练。验证数据被“浪费”了，并未用于模型构建。
- 统计方法在较少的观测值上进行训练时往往表现较差。
- 因此，仅基于训练集训练的模型可能比基于整个数据集训练的模型更差。
- 这个“更差”的模型在验证集上的错误率会更高。这意味着验证集的 MSE 倾向于高估基于所有数据训练的模型的真实测试误差。

3. Cross-Validation: The Solution 交叉验证：解决方案

The slides introduce Cross-Validation (CV) as the method to overcome these drawbacks. The core idea is to use all data points for both training and validation, just at different times. 交叉验证 (CV)，以此来克服这些缺点。其核心思想是将所有数据点用于训练和验证，只是使用的时间不同。

Leave-One-Out Cross-Validation (LOOCV) 留一法交叉验证 (LOOCV)

This is the first type of CV introduced (slide 10, page 26). For a dataset with $n$ data points:

Hold out the 1st data point (this is your validation set). 保留第一个数据点（这是你的验证集）。
Train the model on the other $n-1$ data points. 使用其他 $n-1$ 个数据点训练模型。
Calculate the error (e.g., $\text{MSE}_1$) using only that 1st held-out point. 仅使用第一个保留点计算误差（例如，$\text{MSE}_1$）。
Repeat this $n$ times, holding out the 2nd point, then the 3rd, and so on, until every point has been used as the validation set exactly once. 重复此操作 $n$ 次，保留第二个点，然后是第三个点，依此类推，直到每个点都作为验证集使用一次。
Your final test error estimate is the average of all $n$ errors. 最终的测试误差估计是所有 $n$ 个误差的平均值。

Key Formula (from Slide 10)

The formula for the $n$-fold LOOCV error estimate is: $n$ 倍 LOOCV 误差估计公式为： \[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \text{MSE}_i\]

Where: * $n$ is the total number of data points. 是数据点的总数。 * $\text{MSE}_i$ is the Mean Squared Error calculated on the $i$-th data point when it was held out. 是保留第 $i$ 个数据点时计算的均方误差。

3.What is LOOCV (Leave-One-Out Cross Validation)

Leave-One-Out Cross Validation (LOOCV) is a method for estimating the test error of a model. For a dataset with $n$ observations, you: 留一交叉验证 (LOOCV) 是一种估算模型测试误差的方法。对于包含 $n$ 个观测值的数据集，您需要：

Fit the model $n$ times. 对模型进行 $n$ 次拟合
For each fit $i$ (from $1$ to $n$), you train the model on all data points except for observation $i$. 对于每个拟合 $i$ 个样本（从 $1$ 到 $n$），您需要在除观测值 $i$ 之外的所有数据点上训练模型。
You then use this trained model to make a prediction for the single observation $i$ that was left out. 然后，您需要使用这个训练好的模型对被遗漏的单个观测值 $i$ 进行预测。
The final LOOCV error is the average of the $n$ prediction errors (typically the Mean Squared Error, or MSE). 最终的 LOOCV 误差是 $n$ 个预测误差的平均值（通常为均方误差，简称 MSE）。

This process is shown visually in the slide titled “LOOCV” (slide 27), which is a key image for understanding the concept. Pros & Cons (from slide 28): * Pro: It has low bias because the training set ($n-1$ samples) is almost identical to the full dataset.由于训练集（$n-1$ 个样本）与完整数据集几乎完全相同，因此偏差较低。 * Pro: It produces a stable, non-random error estimate (unlike $k$-fold CV, which depends on the random fold assignments). 它能产生稳定的非随机误差估计（不同于 k 倍交叉验证，后者依赖于随机折叠分配）。 * Con: It can be extremely computationally expensive, as the model must be refit $n$ times. 由于模型必须重新拟合 $n$ 次，计算成本极其高昂。 * Con: The $n$ error estimates can be highly correlated, which can sometimes lead to high variance in the final $CV$ estimate. 这 $n$ 个误差估计可能高度相关，有时会导致最终 $CV$ 估计值出现较大方差。

Key Mathematical Formulas

The main challenge of LOOCV (being computationally expensive) has a very efficient solution for linear models. LOOCV 的主要挑战（计算成本高昂）对于线性模型来说，有一个非常有效的解决方案。

1. The Standard (Slow) Formula

As defined on slide 33, the LOOCV estimate of the MSE is:

\[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i^{(i)})^2\]

$y_i$ is the true value of the $i$-th observation. 是第 $i$ 个观测值的真实值。
$\hat{y}_i^{(i)}$ is the predicted value for $y_i$ from a model trained on all data except observation $i$. 是使用除观测值 $i$ 之外的所有数据训练的模型对 $y_i$ 的预测值。

Calculating $\hat{y}_i^{(i)}$ requires refitting the model $n$ times. 计算 $\hat{y}_i^{(i)}$ 需要重新拟合模型 $n$ 次。

2. The Shortcut (Fast) Formula

Slide 34 provides a much simpler formula that only requires fitting the model once on the entire dataset: 只需对整个*数据集进行一次模型拟合**：

\[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2\]

$\hat{y}_i$ is the prediction for $y_i$ from the model trained on all $n$ data points. 是使用所有 $n$ 个数据点训练的模型对 $y_i$ 的预测值。
$h_i$ is the leverage of the $i$-th observation. 是第 $i$ 个观测值的杠杆率。

3. What is Leverage ($h_i$)?

Slide 35 defines leverage:

Hat Matrix ($\mathbf{H}$): In a linear model, the fitted values $\hat{\mathbf{y}}$ are related to the true values $\mathbf{y}$ by the hat matrix: $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$.
Formula: The hat matrix is defined as $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$.
Leverage ($h_i$): The leverage for the $i$-th observation is simply the $i$-th diagonal element of the hat matrix, $h_{ii}$ (often just written as $h_i$).
- $h_i = \mathbf{x}_i^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i$
Meaning: Leverage measures how “influential” an observation’s $x_i$ value is in determining its own predicted value $\hat{y}_i$. A high leverage score means that point has a lot of influence on the model’s fit.
帽子矩阵 ($\mathbf{H}$)：在线性模型中，拟合值 $\hat{\mathbf{y}}$ 与真实值 $\mathbf{y}$ 之间存在帽子矩阵关系：$\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$。
公式：帽子矩阵定义为 $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$。
杠杆率 ($h_i$)：第 $i$ 个观测值的杠杆率就是帽子矩阵的第 $i$ 个对角线元素 $h_{ii}$（通常写为 $h_i$）。
$h_i = \mathbf{x}_i^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i$
含义：杠杆率衡量观测值的 $x_i$ 值对其自身预测值 $\hat{y}_i$ 的“影响力”。杠杆率得分高意味着该点对模型拟合有很大影响。

This shortcut formula is extremely important because it makes LOOCV as fast to compute as a single model fit.这个快捷公式非常重要，因为它使得 LOOCV 的计算速度与单个模型拟合一样快。

Python Code Explained (Slide 29)

This slide shows how to use LOOCV to select the best polynomial degree for predicting mpg from horsepower.

Imports: It imports standard libraries (pandas, matplotlib) and key modules from sklearn:
- LinearRegression: The model to be fit.
- PolynomialFeatures: A tool to create polynomial terms (e.g., $x, x^2, x^3$).
- LeaveOneOut: The LOOCV cross-validation strategy object.
- cross_val_score: A function that automatically runs a cross-validation test.
Setup:
- It loads the Auto.csv data.
- It defines $X$ (horsepower) and $y$ (mpg).
- It creates a LeaveOneOut object: loo = LeaveOneOut().
Looping through Degrees:
- The code loops degree from 1 to 10.
- make_pipeline: For each degree, it creates a model using make_pipeline. This pipeline is a crucial concept:
  - It first runs PolynomialFeatures(degree) to transform $X$ into $[X, X^2, ..., X^{\text{degree}}]$.
  - It then feeds those features into LinearRegression() to fit the model.
- cross_val_score: This is the most important line.
  - scores = cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error')
  - This function automatically does the entire LOOCV process. It takes the model (the pipeline), the data $X$ and $y$, and the CV strategy (cv=loo).
  - sklearn’s cross_val_score uses the “fast” leverage method internally for linear models, so it doesn’t actually fit the model $n$ times.
  - It uses scoring='neg_mean_squared_error' because the scoring function assumes “higher is better.” By calculating the negative MSE, the best model will have the highest score (i.e., closest to 0).
- Storing Results: It calculates the mean of the scores (which is the $CV_{(n)}$) and stores it.
Visualization:
- The code then plots the final cv_errors (after flipping the sign back to positive) against the degree.
- The resulting plot (also on slide 32) shows the test MSE, allowing you to visually pick the best degree (where the error is minimized).
- 生成的图（也在幻灯片 32 上）显示了测试 MSE，让您可以直观地选择最佳 degree（误差最小化的 degree）。

Important Images

Slide 27 (.../103628.png): This is the best conceptual image. It visually demonstrates how LOOCV splits the data $n$ times, with each observation getting one turn as the validation set. 这是最佳概念图**。它直观地展示了 LOOCV 如何将数据拆分 $n$ 次，每个观察值都会被旋转一次作为验证集。
Slide 34 (.../103711.png): This slide presents the most important formula: the “Easy formula” or shortcut, $CV_{(n)} = \frac{1}{n} \sum (\frac{y_i - \hat{y}_i}{1 - h_i})^2$. This is the key takeaway for computing LOOCV efficiently in linear models. 这张幻灯片展示了最重要的公式**：“简单公式”或简称，$CV_{(n)} = \frac{1}{n} \sum (\frac{y_i - \hat{y}_i}{1 - h_i})^2$。这是在线性模型中高效计算 LOOCV 的关键要点。
Slide 32 (.../103701.jpg): This is the key results image. It contrasts the LOOCV error curve (left) with the 10-fold CV error curves (right). It clearly shows that LOOCV produces a single, stable error curve, while 10-fold CV results vary slightly each time it’s run due to the random data splits. 这是关键结果图**。它将 LOOCV 误差曲线（左）与 10 倍 CV 误差曲线（右）进行了对比。它清楚地表明，LOOCV 产生了单一、稳定的误差曲线，而由于数据分割的随机性，10 倍 CV 的结果每次运行时都会略有不同。

4. Cross-Validation Overview

These slides explain Cross-Validation (CV), a method used to estimate the test error of a model, helping to select the best level of flexibility (e.g., the best polynomial degree). It’s an improvement over a single validation set because it uses all the data for both training and validation at different times. 这是一种用于估算模型测试误差的方法，有助于选择最佳的灵活性（例如，最佳多项式次数）。它比单个验证集有所改进，因为它在不同时间使用所有数据进行训练和验证。

The two main types discussed are K-fold CV and Leave-One-Out CV (LOOCV). 主要讨论的两种类型是K 折交叉验证和留一法交叉验证 (LOOCV)。

K-Fold Cross-Validation K 折交叉验证

This is the most common method.

The Process

As shown in the slides, the K-fold CV process is: 1. Divide the dataset randomly into $K$ non-overlapping groups (or “folds”), usually of equal size. Common choices are $K=5$ or $K=10$. 将数据集随机划分为 $K$ 个不重叠的组（或“折”），通常大小相等。常见的选择是 $K=5$ 或 $K=10$。 2. Iterate $K$ times: In each iteration $i$, use the $i$-th fold as the validation set and all other $K-1$ folds combined as the training set. 迭代 $K$ 次：在每次迭代 $i$ 中，使用第 $i$ 个样本集作为验证集，并将所有其他 $K-1$ 个样本集合并作为训练集。 3. Calculate the Mean Squared Error ($MSE_i$) on the validation fold. 计算验证集的均方误差 ($MSE_i$)。 4. Average all $K$ error estimates to get the final CV score. 平均所有 $K$ 个误差估计值，得到最终的 CV 分数。 ### Key Formula The final K-fold CV error estimate is the average of the errors from each fold: 最终的 K 折 CV 误差估计值是每个样本集误差的平均值： \[CV_{(K)} = \frac{1}{K} \sum_{i=1}^{K} MSE_i\]

Important Image: The Concept

The diagram in slide 104145.png is the most important for understanding the concept of K-fold CV. It shows a dataset split into 5 folds ($K=5$). The process is repeated 5 times, with a different fold (in beige) held out as the validation set in each run, while the rest (in blue) is used for training. 它展示了一个被分成 5 个样本集 ($K=5$) 的数据集。该过程重复 5 次，每次运行都会保留一个不同的折叠（米色）作为验证集，其余折叠（蓝色）用于训练。

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is just a special case of K-fold CV where $K = n$ (the total number of observations). LOOCV 只是 K 折交叉验证的一个特例，其中 $K = n$（观测值总数）。 * You create $n$ “folds,” each containing just one data point. 创建 $n$ 个“折叠”，每个折叠仅包含一个数据点。 * You train the model $n$ times, each time leaving out a single different observation and then calculating the error for that one point. 对模型进行 $n$ 次训练，每次都省略一个不同的观测值，然后计算该点的误差。

Key Formulas

Standard Definition: The LOOCV error is the average of the $n$ squared errors: \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2\] where $e_{[i]} = y_i - \hat{y}_{[i]}$ is the prediction error for the $i$-th observation, calculated from a model that was trained on all data except the $i$-th observation. This looks computationally expensive. LOOCV 误差是 $n$ 个平方误差的平均值： \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2\] 其中 $e_{[i]} = y_i - \hat{y}_{[i]}$ 是第 $i$ 个观测值的预测误差，该误差由一个使用除第 $i$ 个观测值以外的所有数据训练的模型计算得出。这看起来计算成本很高。
Fast Computation (for Linear Regression): A key point from the slides is that for linear regression, you don’t need to re-fit the model $N$ times. You can fit the model once on all $N$ data points and use the following shortcut: \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\]
- $e_i = y_i - \hat{y}_i$ is the standard residual (from the model fit on all data).
- $h_i$ is the leverage statistic for the $i$-th observation (the $i$-th diagonal entry of the “hat matrix” $H$). This makes LOOCV as fast to compute as a single model fit. 对于线性回归，您无需重新拟合模型 $N$ 次。您可以对所有 $N$ 个数据点一次性地拟合模型，并使用以下快捷方式： \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\]
- $e_i = y_i - \hat{y}_i$ 是标准残差（来自对所有数据的模型拟合）。
- $h_i$ 是第 $i$ 个观测值（“帽子矩阵”$H$ 的第 $i$ 个对角线元素）的杠杆统计量。这使得 LOOCV 的计算速度与单次模型拟合一样快。

Python Code & Results

The Python code in slide 104156.jpg shows how to use 10-fold CV to find the best polynomial degree for a model.

Code Understanding (Slide `104156.jpg`)

Here’s a breakdown of the key sklearn parts:

from sklearn.pipeline import make_pipeline: This is used to chain steps. The pipeline make_pipeline(PolynomialFeatures(degree), LinearRegression()) first creates polynomial features (like $x$, $x^2$, $x^3$) and then fits a linear model to them.
from sklearn.model_selection import KFold: This object is used to define the $K$-fold split strategy. kf = KFold(n_splits=10, shuffle=True, random_state=1) creates a 10-fold splitter that shuffles the data first.
from sklearn.model_selection import cross_val_score: This is the most important function.
- scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
- This one function does all the work: it takes the model (the pipeline), the data X and y, and the CV splitter kf. It automatically trains and evaluates the model 10 times and returns an array of 10 scores (one for each fold).
- scoring='neg_mean_squared_error' is used because cross_val_score expects a higher score to be better. Since we want to minimize MSE, we use negative MSE.
avg_mse = -scores.mean(): The code averages the 10 scores and flips the sign back to positive to get the final CV (MSE) estimate for that polynomial degree.

Important Image: The Results

The plots in slides 104156.jpg (Python) and 104224.png (R) show the key result.

X-axis: Degree of Polynomial (model complexity).多项式的次数（模型复杂度）。
Y-axis: Estimated Test Error (CV Error / MSE).估计测试误差（CV 误差 / MSE）。
Interpretation: The plot shows a clear “U” shape. The error is high for degree 1 (a simple line), drops to its minimum at degree 2 (a quadratic $ax^2 + bx + c$), and then starts to rise again for higher degrees. This rise indicates overfitting—the more complex models are fitting the training data’s noise, leading to worse performance on unseen validation data. 该图呈现出清晰的“U”形。1 次（一条简单的直线）时误差较大，在2 次（二次 $ax^2 + bx + c$）时降至最小，然后随着次数的增加，误差再次上升。这种上升表明过拟合——更复杂的模型会拟合训练数据的噪声，导致在未见过的验证数据上的性能下降。
Conclusion: The 10-fold CV analysis suggests that a quadratic model (degree 2) is the best choice, as it provides the lowest estimated test error. 10 倍 CV 分析表明二次模型（2 次）是最佳选择，因为它提供了最低的估计测试误差。

Let’s dive into the details of that proof.

Detailed Summary: The “Fast Computation of LOOCV” Proof

The most mathematically dense and important part of your slides is the proof (spanning slides 104126.jpg, 104132.png, and 104136.png) that LOOCV, which seems computationally very expensive, can be calculated quickly for linear regression. LOOCV 虽然计算成本看似非常高，但对于线性回归来说，它可以快速计算。 ### The Goal

The goal is to prove that the LOOCV statistic, which is defined as: \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2 \quad \text{where } e_{[i]} = y_i - \hat{y}_{[i]}\] (Here, $\hat{y}_{[i]}$ is the prediction for $y_i$ from a model trained on all data except point $i$).（其中，$\hat{y}_{[i]}$ 表示基于除点 $i$ 之外的所有数据训练的模型对 $y_i$ 的预测）。

…can be computed without re-fitting the model $N$ times, using this “fast” formula: 无需重新拟合模型 $N$ 次即可计算，使用以下“快速”公式： \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\] (Here, $e_i$ is the standard residual and $h_i$ is the leverage, both from a single model fit on all data).

The entire proof boils down to showing one identity: $e_{[i]} = e_i / (1 - h_i)$.

Key Definitions (The Matrix Algebra Setup) （矩阵代数设置）

Model 模型: $\mathbf{Y} = \mathbf{X}\beta + \mathbf{e}$
Full Data Estimate 完整数据估计 ($\hat{\beta}$): $\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$
Hat Matrix 帽子矩阵 ($\mathbf{H}$): $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$
Full Data Residual 完整数据残差 ($e_i$): $e_i = y_i - \hat{y}_i = y_i - \mathbf{x}_i^T\hat{\beta}$
Leverage ($h_i$) 杠杆 ($h_i$): The $i$-th diagonal element of $\mathbf{H}$. $h_i = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i$
Leave-One-Out Estimate ($\hat{\beta}_{[i]}$): $\hat{\beta}_{[i]} = (\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]}$
- $\mathbf{X}_{[i]}$ and $\mathbf{Y}_{[i]}$ are the data with the $i$-th row removed.
LOOCV Residual LOOCV 残差 ($e_{[i]}$): $e_{[i]} = y_i - \mathbf{x}_i^T\hat{\beta}_{[i]}$

The Proof Step-by-Step

Here is the logic from your slides, broken down:

Step 1: Relating the Matrices (Slide `104132.png`)

The proof’s “trick” is to relate the “full data” matrix $(\mathbf{X}^T\mathbf{X})$ to the “leave-one-out” matrix $(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})$. 证明的“技巧”是将“全数据”矩阵 $(\mathbf{X}^T\mathbf{X})$ 与“留一法”矩阵 $(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})$ 关联起来。

The full sum-of-squares matrix is just the leave-one-out matrix plus the one observation’s contribution: 完整的平方和矩阵就是留一法矩阵加上一个观测值的贡献：

\[\mathbf{X}^T\mathbf{X} = \mathbf{X}_{[i]}^T\mathbf{X}_{[i]} + \mathbf{x}_i\mathbf{x}_i^T\]
This means: $\mathbf{X}_{[i]}^T\mathbf{X}_{[i]} = \mathbf{X}^T\mathbf{X} - \mathbf{x}_i\mathbf{x}_i^T$

Step 2: The Key Matrix Trick (Slide `104132.png`)

We need the inverse $(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}$ to calculate $\hat{\beta}_{[i]}$. Finding this inverse directly is hard. Instead, we use the Sherman-Morrison-Woodbury formula, which tells us how to find the inverse of a matrix that’s been “updated” (in this case, by subtracting $\mathbf{x}_i\mathbf{x}_i^T$).

我们需要逆$(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}$ 来计算 $\hat{\beta}_{[i]}$。直接求这个逆矩阵很困难。因此，我们使用 Sherman-Morrison-Woodbury 公式，它告诉我们如何求一个“更新”后的矩阵的逆矩阵（在本例中，是通过减去 $\mathbf{x}_i\mathbf{x}_i^T$ 来实现的）。

The slide applies this formula to get: \[(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_i}\] * This is the most complex step, but it’s a standard matrix identity. It’s crucial because it expresses the “leave-one-out” inverse in terms of the “full data” inverse $(\mathbf{X}^T\mathbf{X})^{-1}$, which we already have.

Step 3: Finding $\hat{\beta}_{[i]}$ (Slide `104136.png`)

Now we can write a new formula for $\hat{\beta}_{[i]}$ by substituting the result from Step 2. We also note that $\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]} = \mathbf{X}^T\mathbf{Y} - \mathbf{x}_i y_i$.

\[\hat{\beta}_{[i]} = (\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1} (\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]})\] \[\hat{\beta}_{[i]} = \left[ (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_i} \right] (\mathbf{X}^T\mathbf{Y} - \mathbf{x}_i y_i)\]

The slide then shows the algebra to simplify this big expression. When you expand and simplify everything, you get a much cleaner result:

\[\hat{\beta}_{[i]} = \hat{\beta} - (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \frac{e_i}{1 - h_i}\] * This is a beautiful result! It says the LOOCV coefficient vector is just the full coefficient vector minus a small adjustment term related to the $i$-th observation’s residual ($e_i$) and leverage ($h_i$). * 这是一个非常棒的结果！它表明 LOOCV 系数向量就是完整的系数向量减去一个与第 $i$ 个观测值的残差 ($e_i$) 和杠杆率 ($h_i$) 相关的小调整项。

Step 4: Finding $e_{[i]}$ (Slide `104136.png`)

This is the final step. We use the definition of $e_{[i]}$ and the result from Step 3. 这是最后一步。我们使用 $e_{[i]}$ 的定义和步骤 3 的结果。

Start with the definition: $e_{[i]} = y_i - \mathbf{x}_i^T\hat{\beta}_{[i]}$
Substitute $\hat{\beta}_{[i]}$: $e_{[i]} = y_i - \mathbf{x}_i^T \left[ \hat{\beta} - (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \frac{e_i}{1 - h_i} \right]$
Distribute $\mathbf{x}_i^T$: $e_{[i]} = (y_i - \mathbf{x}_i^T\hat{\beta}) + \left( \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \right) \frac{e_i}{1 - h_i}$
Recognize the terms!
- The first term is just the standard residual: $(y_i - \mathbf{x}_i^T\hat{\beta}) = e_i$
- The second term in parentheses is the definition of leverage: $(\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i) = h_i$
Substitute back: $e_{[i]} = e_i + h_i \left( \frac{e_i}{1 - h_i} \right)$
Get a common denominator: $e_{[i]} = \frac{e_i(1 - h_i) + h_i e_i}{1 - h_i}$
Simplify the numerator: $e_{[i]} = \frac{e_i - e_ih_i + e_ih_i}{1 - h_i}$

This gives the final, simple relationship: \[e_{[i]} = \frac{e_i}{1 - h_i}\]

Conclusion

By proving this identity, the slides show that to get all $N$ of the “leave-one-out” errors, you only need to: 1. Fit one linear regression model on all the data. 2. Calculate the standard residuals $e_i$ and the leverage values $h_i$ for all $N$ points. 3. Apply the formula $e_i / (1 - h_i)$ for each point.

This turns a procedure that looked like it would take $N$ times the work into a procedure that takes only 1 model fit. This is why LOOCV is a practical and efficient method for linear regression.

通过证明这个恒等式，幻灯片显示，要获得所有 $N$ 个“留一法”误差，您只需： 1. 对所有数据拟合一个线性回归模型。 2. 计算所有 $N$ 个点的标准残差 $e_i$ 和杠杆值 $h_i$。 3. 对每个点应用公式 $e_i / (1 - h_i)$。

这将一个看似需要 $N$ 倍工作量的过程变成了只需 1 次模型拟合的过程。这就是为什么 LOOCV 是一种实用且高效的线性回归方法。

5. Main Goal of Cross-Validation 交叉验证的主要目标

The central purpose of cross-validation is to estimate the true test error of a machine learning model. This is crucial for:

Model Assessment: Evaluating how well a model will perform on new, unseen data. 评估模型在新的、未见过的数据上的表现。
Model Selection: Choosing the best level of model flexibility (e.g., the degree of a polynomial or the value of $K$ in KNN) to avoid overfitting. 选择最佳的模型灵活性水平（例如，多项式的次数或 KNN 中的 $K$ 值），以避免过拟合。

As the slides show, training error (the error on the data the model was trained on) consistently decreases as model complexity increases. However, the test error follows a U-shape: it first decreases (as the model learns the true signal) and then increases (as the model starts fitting the noise, or “overfitting”). CV helps find the minimum point of this U-shaped test error curve. 训练误差（模型训练数据的误差）随着模型复杂度的增加而持续下降。然而，测试误差呈现 U 形：它先下降（当模型学习真实信号时），然后上升（当模型开始拟合噪声，即“过拟合”时）。交叉验证有助于找到这条 U 形测试误差曲线的最小值。

Important Images 🖼️

The most important image is on Slide 61.

These two plots perfectly illustrate the concept:

Blue Line (Training Error): Always goes down.
Brown Line (True Test Error): Forms a “U” shape. This is what we want to find the minimum of, but it’s unknown in practice.
Black Line (10-fold CV Error): This is our estimate of the test error. Notice how closely it tracks the brown line. The minimum of the CV curve (marked with an ‘x’) is very close to the minimum of the true test error.

This shows why CV works: it provides a reliable estimate to guide our choice of model (e.g., polynomial degree 3-4 for logistic regression, or $K \approx 10$ for KNN).

蓝线（训练误差）：始终向下。
棕线（真实测试误差）：呈“U”形。这正是我们想要找到的最小值，但在实际应用中无法确定。
黑线（10 倍 CV 误差）：这是我们对测试误差的估计。注意它与棕线的吻合程度。CV 曲线的最小值（标有“x”）非常接近真实测试误差的最小值。

这说明了 CV 的原因：它提供了可靠的估计值来指导我们选择模型（例如，逻辑回归的多项式次数为 3-4，KNN 的 $K \approx 10$）。

Key Formulas for Classification

For regression, we often use Mean Squared Error (MSE). For classification, the slides introduce the classification error rate.

For Leave-One-Out Cross-Validation (LOOCV), the error for a single observation $i$ is: \[Err_i = I(y_i \neq \hat{y}_i^{(i)})\] * $y_i$ is the true label for observation $i$. * $\hat{y}_i^{(i)}$ is the model’s prediction for observation $i$ when the model was trained on all other observations except $i$. * $I(\dots)$ is an indicator function: it’s $1$ if the condition is true (prediction is wrong) and $0$ if false (prediction is correct).

The total CV error is simply the average of these individual errors, which is the overall fraction of incorrect classifications: \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\] The slides also show examples using Log Loss (Slide 64), which is another common and sensitive metric for classification. The logistic regression model itself is defined by: \[P(Y=1 | X) = \frac{1}{1 + \exp(-\beta_0 - \beta_1 X_1 - \beta_2 X_2 - \dots)}\]

对于回归，我们通常使用均方误差 (MSE)。对于分类，幻灯片介绍了分类错误率。

对于留一交叉验证 (LOOCV)，单个观测值 $i$ 的误差为： \[Err_i = I(y_i \neq \hat{y}_i^{(i)})\] * $y_i$ 是观测值 $i$ 的真实标签。 * $\hat{y}_i^{(i)}$ 是模型在除 $i$ 之外的所有其他观测值上进行训练后，对观测值 $i$ 的预测。 * $I(\dots)$ 是一个指示函数：如果条件为真（预测错误），则为 $1$；如果条件为假（预测正确），则为 $0$。

总CV误差只是这些单个误差的平均值，也就是错误分类的总体比例： \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\] 幻灯片还展示了使用对数损失（幻灯片64）的示例，这是另一个常见且敏感的分类指标。逻辑回归模型本身的定义如下： \[P(Y=1 | X) = \frac{1}{1 + \exp(-\beta_0 - \beta_1 X_1 - \beta_2 X_2 - \dots)}\]

Python Code Explained 🐍

The slides provide two key Python examples. Both manually implement K-fold cross-validation to show how it works.

1. KNN Regression (Slide 52) KNN 回归

Goal: Find the best n_neighbors (K) for a KNeighborsRegressor. 为 KNeighborsRegressor 找到最佳的 n_neighbors (K)。
Logic:
1. It creates a KFold object to split the data into 10 folds (n_splits=10). 创建一个 KFold 对象，将数据拆分成 10 个折叠（n_splits=10）。
2. It has an outer loop that iterates through different values of $K$ (from 1 to 10). 它有一个 外循环，迭代不同的 $K$ 值（从 1 到 10）。
3. It has an inner loop that iterates through the 10 folds (for train_index, test_index in kfold.split(X)). 它有一个 内循环，迭代这 10 个折叠（for train_index, test_index in kfold.split(X)）。
4. Inside the inner loop:
  - It trains a KNeighborsRegressor on the 9 training folds (X_train, y_train).
  - It makes predictions on the 1 held-out test fold (X_test).
  - It calculates the mean squared error for that fold and stores it.
  - 在 9 个训练折叠（X_train, y_train）上训练 KNeighborsRegressor。
  - 它对第一个保留的测试集 (X_test) 进行预测。
  - 它计算该集的均方误差并存储。
5. After the inner loop: It averages the 10 error scores (one from each fold) to get the final CV error for that specific $K$. 对 10 个误差分数（每个集一个）求平均值，得到该特定 $K$ 的最终 CV 误差。
6. The final plot shows this CV error vs. $K$, allowing us to pick the $K$ with the lowest error. 最终图表显示了 CV 误差与 $K$ 的关系，使我们能够选择误差最小的 $K$。

2. Logistic Regression with Polynomials (Slide 64) 使用多项式的逻辑回归

Goal: Find the best degree for PolynomialFeatures used with LogisticRegression.
Logic: This is very similar to the KNN example but uses a different model and error metric.
1. It sets up a 10-fold split (kf = KFold(...)).
2. An outer loop iterates through the degree $d$ (from 1 to 10).
3. An inner loop iterates through the 10 folds.
4. Inside the inner loop:
  - It creates PolynomialFeatures of degree $d$.
  - It transforms the 9 training folds (X_train) into polynomial features (X_train_poly).
  - It trains a LogisticRegression model on X_train_poly.
  - It transforms the 1 held-out test fold (X_test) using the same polynomial transformer.
  - It calculates the log_loss on the test fold.
5. After the inner loop: It averages the 10 log_loss scores to get the final CV error for that degree.
6. The plot shows CV error vs. degree, and the minimum is clearly at degree=3.

The Bias-Variance Trade-off in CV CV 中的偏差-方差权衡

This is a key theoretical point from Slide 54 that answers the questions on Slide 65. It compares LOOCV ($K=n$) with K-fold CV ($K=5$ or $10$). 这是幻灯片 54中的一个关键理论点，它回答了幻灯片 65 中的问题。它比较了 LOOCV（K=n）和 K 倍 CV（K=5 或 10）。

LOOCV (K=n):
- Bias: Very low. The model is trained on $n-1$ samples, which is almost the full dataset. The resulting error estimate is nearly unbiased for the true test error. 该模型基于 $n-1$ 个样本进行训练，这几乎是整个数据集。得到的误差估计对于真实测试误差几乎没有偏差。
- Variance: Very high. You are training $n$ models that are almost identical to each other (they only differ by one data point). Averaging these highly correlated error estimates doesn’t reduce the variance much, making the CV estimate unstable. 非常高。您正在训练 $n$ 个彼此几乎相同的模型（它们仅相差一个数据点）。对这些高度相关的误差估计求平均值并不能显著降低方差，从而导致 CV 估计不稳定。
K-Fold CV (K=5 or 10):
- Bias: Slightly higher than LOOCV. The models are trained on, for example, 90% of the data. Since they are trained on less data, they might perform slightly worse. This means K-fold CV tends to slightly overestimate the true test error (Slide 66).
- Variance: Much lower than LOOCV. The 10 models are trained on more different “chunks” of data (they overlap less), so their error estimates are less correlated. Averaging less-correlated estimates significantly reduces the overall variance.

Conclusion: We generally prefer 10-fold CV over LOOCV. It gives a much more stable (low-variance) estimate of the test error, even if it’s slightly more biased (overestimating the error, which is a safe/conservative estimate). 我们通常更喜欢10 倍交叉验证而不是 LOOCV。它能给出更稳定（低方差）的测试误差估计值，即使它的偏差略大（高估了误差，这是一个安全/保守的估计值）。

The Core Problem & Scenarios (Slides 47-51)

These slides use three scenarios to show why we need cross-validation (CV). The goal is to pick the right level of model flexibility (e.g., the degree of a polynomial or the complexity of a spline) to minimize the Test MSE (Mean Squared Error), which we can’t see in real life. 这些幻灯片使用了三种场景来说明为什么我们需要交叉验证 (CV)。目标是选择合适的模型灵活性（例如，多项式的次数或样条函数的复杂度），以最小化测试均方误差（Mean Squared Error），而这在现实生活中是无法观察到的。

The Curves (Slide 47): This slide is central.
- True Test MSE (Blue) 真实测试均方误差（蓝色）: This is the real error on new data. It has a U-shape. Error is high for simple models (high bias), drops as the model fits the data, and rises again for overly complex models (high variance, or overfitting). 这是新数据的真实误差。它呈U 形**。对于简单模型（高偏差），误差较高；随着模型拟合数据的深入，误差会下降；对于过于复杂的模型（高方差或过拟合），误差会再次上升。
- LOOCV (Black Dashed) & 10-Fold CV (Orange) LOOCV（黑色虚线）和 10 倍 CV（橙色）: These are our estimates of the true test MSE. Notice how closely they track the blue curve. The ‘x’ marks the minimum of the CV curve, which is our best guess for the model with the minimum test MSE. 这些是我们对真实测试 MSE 的估计。请注意它们与蓝色曲线的吻合程度。“x”标记 CV 曲线的最小值，这是我们对具有最小测试 MSE 的模型的最佳猜测。
Scenario 1 (Slide 48): The true relationship is non-linear. The right-hand plot shows that the test MSE (red curve) is high for the simple linear model (blue square), but lower for the more flexible smoothing splines (teal squares). CV helps us find the “sweet spot.” 真实的关系是非线性的。右侧图表显示，对于简单的线性模型（蓝色方块），测试 MSE（红色曲线）较高，而对于更灵活的平滑样条函数（蓝绿色方块），测试 MSE 较低。CV 帮助我们找到“最佳点”。
Scenario 2 (Slide 49): The true relationship is linear. Here, the test MSE (red curve) is lowest for the simplest model (the linear one, blue square). CV correctly identifies this, and its error estimate (blue square) is lowest for that model. 真实的关系是线性的。在这里，对于最简单的模型（线性模型，蓝色方块），测试 MSE（红色曲线）最低。CV 正确地识别了这一点，并且其误差估计（蓝色方块）是该模型中最低的。
Scenario 3 (Slide 50): The true relationship is highly non-linear. The linear model (orange) is a very poor fit. The test MSE (red curve) is minimized by the most flexible model (teal square). CV again finds this. 真实的关系是高度非线性的。线性模型（橙色）拟合度很差。测试 MSE（红色曲线）被最灵活的模型（蓝绿色方块）最小化。CV 再次发现了这一点。
Key Takeaway (Slide 51): We use CV to find the tuning parameter (like polynomial degree) that minimizes the test error. We care less about the actual value of the CV error and more about where its minimum is. 我们使用 CV 来找到最小化测试误差的调整参数（例如多项式次数）。我们不太关心 CV 误差的实际值，而更关心它的最小值。

CV for Classification (Slides 55-61)

This section shifts from regression (predicting a number, using MSE) to classification (predicting a category, like “blue” or “orange”). 本节从回归（使用 MSE 预测数字）转向分类（预测类别，例如“蓝色”或“橙色”）。

New Error Metric (Slide 55): We can’t use MSE. A natural choice is the classification error rate. 我们不能使用 MSE。一个自然的选择是分类错误率。
- $Err_i = I(y_i \neq \hat{y}_i^{(i)})$
- This is an indicator function: it is 1 if the prediction for the $i$-th data point (when trained without it) is wrong, and 0 if it’s correct. 如果对第 $i$ 个数据点的预测（在没有它的情况下训练时）错误，则为 1；如果正确，则为 0。
- The final CV error is just the average of these 0s and 1s, giving the total fraction of misclassified points: $CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i$ 最终的 CV 误差就是这些 0 和 1 的平均值，即错误分类点的总比例：$CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i$
The Example (Slides 56-61):
- Slides 56-58: We are shown a “true” (but unknown) non-linear boundary (purple dashed line) separating two classes. We then try to estimate this boundary using logistic regression with different polynomial degrees (degree 1, 2, 3, 4). 我们看到了一条“真实”（但未知）的非线性边界（紫色虚线），它将两个类别分开。然后，我们尝试使用不同次数（1、2、3、4 次）的逻辑回归来估计这条边界。
- Slides 59-60: This is a crucial point. In this simulated example, we do know the true test error rates. The true errors are [0.201, 0.197, 0.160, 0.162]. The lowest error is for the 3rd-degree polynomial. But in a real-world problem, we can never know these true errors. 这一点至关重要。在这个模拟示例中，我们确实知道真实的测试错误率。真实误差为 [0.201, 0.197, 0.160, 0.162]。最小误差出现在三次多项式中。但在实际问题中，我们永远无法知道这些真实误差。
- Slide 61 (The Solution): This is the most important image. It shows how CV solves the problem from slide 60.展示了 CV 如何解决幻灯片 60 中的问题。
  - Brown Curve (Test Error): This is the true test error (from slide 59). We can’t see this in practice. Its minimum is at degree 3. 这是真实的测试误差（来自幻灯片 59）。我们在实践中看不到它。它的最小值在 3 次方处。
  - Black Curve (10-fold CV Error): This is what we can calculate. It’s our estimate of the test error. Crucially, its minimum is also at degree 3.
  - 黑色曲线（10 倍 CV 误差）：这是我们可以计算出来的。这是我们对测试误差的估计。至关重要的是，它的最小值也在 3 次方处。
  - This proves that CV successfully found the best model (degree 3) without ever seeing the true test error. The same logic is shown for the KNN classifier on the right.
  - 这证明 CV 成功地找到了最佳模型（3 次方），而从未看到真实的测试误差。右侧的 KNN 分类器也显示了相同的逻辑。

Python Code Explained (Slides 52, 63, 64)

The slides show how to manually implement K-fold CV. This is great for understanding, even though libraries like GridSearchCV can do this automatically.

KNN Regression (Slide 52):
1. kfold = KFold(n_splits=10, ...): Creates an object that knows how to split the data into 10 folds.
2. for n_k in neighbors:: This is the outer loop to test different $K$ values (e.g., $K$=1, 2, 3…).
3. for train_index, test_index in kfold.split(X):: This is the inner loop. For a single $K$, it loops 10 times.
4. Inside the inner loop:
  - It splits the data into a 9-fold training set (X_train) and a 1-fold test set (X_test).
  - It trains a KNeighborsRegressor on X_train.
  - It makes predictions on X_test and calculates the error (mean_squared_error).
5. cv_errors.append(np.mean(mse_errors_k)): After the inner loop finishes 10 runs, it averages the 10 error scores for that $K$ and stores it.
6. The final plot shows cv_errors vs. neighbors, letting you pick the $K$ with the lowest average error.
Logistic Regression Classification (Slides 63-64):
- This code is almost identical, but with three key differences:
  1. The model is LogisticRegression.
  2. It uses PolynomialFeatures to create new features ($X^2, X^3,$ etc.) inside the loop.
  3. The error metric is log_loss (a common, more sensitive metric than the simple 0/1 error rate).
- The plot on slide 64 shows the 10-fold CV error (using Log Loss) vs. the Degree of the Polynomial. The minimum is clearly at Degree = 3, matching the finding from slide 61.

Answering the Key Questions (Slides 54 & 65)

Slide 65 asks two critical questions, which are answered directly by the concepts on Slide 54 (Bias and variance trade-off).

Q1: How does K affect the bias and variance of the CV error?

This refers to $K$ in K-fold CV (not to be confused with $K$ in KNN). K 如何影响 CV 误差的偏差和方差？

Bias:
- LOOCV (K = n): This has very low bias. The model is trained on $n-1$ samples, which is almost the full dataset. So, the error estimate $CV_{(n)}$ is an almost-unbiased estimate of the true test error. ** 它的偏差非常低。该模型基于 $n-1$ 个样本进行训练，这几乎是整个数据集。因此，误差估计 $CV_{(n)}$ 是对真实测试误差的几乎无偏估计。
- K-Fold (K < n, e.g., K=10): This has slightly higher bias. The models are trained on, for example, 90% of the data. Because they are trained on less data, they might perform slightly worse than a model trained on 100% of the data. This “pessimism” is the source of the bias. 偏差略高。例如，这些模型是基于 90% 的数据进行训练的。由于它们基于较少的数据进行训练，因此它们的性能可能会比基于 100% 数据进行训练的模型略差。这种“悲观”正是偏差的根源。
Variance:
- LOOCV (K = n): This has very high variance. You are training $n$ models that are almost identical (they only differ by one data point). Averaging $n$ highly-correlated error estimates doesn’t reduce the variance much. This makes the final $CV_{(n)}$ estimate unstable. 这种模型的方差非常高**。您正在训练 $n$ 个几乎相同的模型（它们只有一个数据点不同）。对 $n$ 个高度相关的误差估计取平均值并不能显著降低方差。这使得最终的 $CV_{(n)}$ 估计值不稳定。
- K-Fold (K < n, e.g., K=10): This has much lower variance. The 10 models are trained on more different “chunks” of data (they overlap less). Their error estimates are less correlated, and averaging 10 less-correlated numbers gives a much more stable (low-variance) final estimate. 这种模型的方差非常低**。这 10 个模型基于更多不同的数据“块”进行训练（它们重叠较少）。它们的误差估计值相关性较低，对 10 个相关性较低的数取平均值可以得到更稳定（低方差）的最终估计值。

Conclusion (The Trade-off): We prefer K-fold CV (K=5 or 10) over LOOCV. It gives a much more stable (low-variance) estimate, and we are willing to accept a tiny increase in bias to get it. 我们更喜欢K 倍交叉验证（K=5 或 10），而不是单倍交叉验证。它能给出更稳定（低方差）的估计值，并且我们愿意接受偏差的轻微增加来获得它。

Q2: Does Cross Validation over-estimate or under-estimate the true test error?

交叉验证会高估还是低估真实测试误差？

Based on the bias discussion above:

Cross-validation (especially K-fold) generally over-estimates the true test error. 交叉验证（尤其是 K 倍交叉验证）通常会高估真实测试误差。

Reasoning: 1. The “true test error” is the error of a model trained on the entire dataset ($n$ samples). 2. K-fold CV trains its models on subsets of the data (e.g., $n \times (K-1)/K$ samples). 3. Since these models are trained on less data, they are (on average) slightly worse than the final model trained on all the data. 4. Because the CV models are slightly worse, their error rates will be slightly higher. 5. Therefore, the final CV error score is a slightly “pessimistic” or high estimate. This is considered a good thing, as it’s a conservative estimate of how our model will perform. 理由： 1. “真实测试误差”是指在整个数据集（$n$ 个样本）上训练的模型的误差。 2. K 折交叉验证 (K-fold CV) 在数据子集上训练其模型（例如，$n \times (K-1)/K$ 个样本）。 3. 由于这些模型基于较少的数据进行训练，因此它们（平均而言）比基于所有数据训练的最终模型略差。 4. 由于 CV 模型略差，其错误率会略高。 5. 因此，最终的 CV 错误率是一个略微“悲观”或偏高的估计。这被认为是一件好事，因为它是对模型性能的保守*估计。

6. Summary of Bootstrap

Bootstrap is a resampling technique used to estimate the uncertainty (like standard error or confidence intervals) of a statistic. Its key idea is to treat your original data sample as a proxy for the true population. It then simulates the process of drawing new samples by instead sampling with replacement from your original sample. Bootstrap 是一种重采样技术，用于估计统计数据的不确定性（例如标准误差或置信区间）。其核心思想是将原始数据样本视为真实总体的替代样本。然后，它通过从原始样本中进行有放回的抽样来模拟抽取新样本的过程。

The Problem

You have a single data sample (e.g., $n=100$ people) and you calculate a statistic, like the sample mean ($\bar{x}$) or a regression coefficient ($\hat{\beta}$). You want to know how accurate this statistic is. How much would it vary if you could repeat your experiment many times? This variation is measured by the standard error (SE). 您有一个数据样本（例如，$n=100$ 人），并计算一个统计数据，例如样本均值 ($\bar{x}$) 或回归系数 ($\hat{\beta}$)。您想知道这个统计数据的准确度。如果可以多次重复实验，它会有多少变化？这种变化可以用标准误差 (SE) 来衡量。

The Bootstrap Solution

Since you can’t re-run the whole experiment, you simulate it using the one sample you have. 由于您无法重新运行整个实验，因此您可以使用现有的一个样本进行“模拟”。

The Process: 1. Original Sample ($Z$) 原始样本 ($Z$): You have your one dataset with $n$ observations. 2. Bootstrap Sample ($Z^{*1}$) Bootstrap 样本 ($Z^{*1}$): Create a new dataset of size $n$ by randomly pulling observations from your original sample with replacement. (This means some original observations will be picked multiple times, and some not at all). 3. Calculate Statistic ($\hat{\theta}^{*1}$) 计算统计量 ($\hat{\theta}^{*1}$): Calculate your statistic of interest (e.g., the mean, $\hat{\alpha}$, regression coefficients) on this new bootstrap sample. 4. Repeat 重复: Repeat steps 2 and 3 a large number of times ($B$, e.g., $B=1000$). This gives you $B$ bootstrap statistics: $\hat{\theta}^{*1}, \hat{\theta}^{*2}, ..., \hat{\theta}^{*B}$. 5. Analyze the Bootstrap Distribution 分析自举分布: This collection of $B$ statistics is your “bootstrap distribution.” * Standard Error 标准误差: The standard deviation of this bootstrap distribution is your estimate of the standard error of your original statistic. * Confidence Interval 置信区间: A 95% confidence interval can be found by taking the 2.5th and 97.5th percentiles of this bootstrap distribution.

Why use it? It’s powerful because it doesn’t rely on strong theoretical assumptions (like data being normally distributed). It can be applied to almost any statistic, even very complex ones (like the prediction from a KNN model), for which a simple mathematical formula for standard error doesn’t exist. 它非常强大，因为它不依赖于严格的理论假设（例如数据服从正态分布）。它几乎可以应用于任何统计数据，即使是非常复杂的统计数据（例如 KNN 模型的预测），因为这些统计数据没有简单的标准误差数学公式。

Mathematical Understanding

The core idea is to use the empirical distribution (your sample) as an estimate for the true population distribution. 其核心思想是使用经验分布（你的样本）来估计真实的总体分布。

Example: Estimating $\alpha$

Your slides provide an example of finding the $\alpha$ that minimizes the variance of a portfolio, $var(\alpha X + (1-\alpha)Y)$. 用于计算使投资组合方差最小化的 $\alpha$，即 $var(\alpha X + (1-\alpha)Y)$。

True Population Parameter ($\alpha$) 真实总体参数 ($\alpha$): The true $\alpha$ is a function of the population variances and covariance: 真实 $\alpha$ 是总体方差和协方差的函数： \[\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 - 2\sigma_{XY}}\] We can never know this value exactly unless we know the entire population. 除非我们了解整个总体，否则我们永远无法准确知道这个值。
Sample Statistic ($\hat{\alpha}$) 样本统计量 ($\hat{\alpha}$): We estimate $\alpha$ using our sample, creating the statistic $\hat{\alpha}$ by plugging in our sample variances and covariance: 我们使用样本估计 $\alpha$，通过代入样本方差和协方差来创建统计量 $\hat{\alpha}$： \[\hat{\alpha} = \frac{\hat{\sigma}_Y^2 - \hat{\sigma}_{XY}}{\hat{\sigma}_X^2 + \hat{\sigma}_Y^2 - 2\hat{\sigma}_{XY}}\] This $\hat{\alpha}$ is just one number from our single sample. How confident are we in it? We need its standard error, $SE(\hat{\alpha})$. 这个 $\hat{\alpha}$ 只是我们单个样本中的一个数字。我们对它的置信度有多高？我们需要它的标准误差，$SE(\hat{\alpha})$。
Bootstrap Statistic ($\hat{\alpha}^*$) 自举统计量 ($\hat{\alpha}^*$): We apply the bootstrap process:
- Create a bootstrap sample (by resampling with replacement). 创建一个自举样本（通过放回重采样）。
- Calculate $\hat{\alpha}^*$ using the sample (co)variances of this new bootstrap sample. 使用这个新自举样本的样本（协）方差计算 $\hat{\alpha}^*$。
- Repeat $B$ times to get $B$ values: $\hat{\alpha}^{*1}, \hat{\alpha}^{*2}, ..., \hat{\alpha}^{*B}$. 重复 $B$ 次，得到 $B$ 个值：$\hat{\alpha}^{*1}, \hat{\alpha}^{*2}, ..., \hat{\alpha}^{*B}$。
Estimating the Standard Error 估算标准误差: The standard error of our original estimate $\hat{\alpha}$ is estimated by the standard deviation of all our bootstrap estimates: 我们原始估计值 $\hat{\alpha}$ 的标准误差是通过所有自举估计值的标准差来“估算”的： \[SE_{boot}(\hat{\alpha}) = \sqrt{\frac{1}{B-1} \sum_{j=1}^{B} (\hat{\alpha}^{*j} - \bar{\alpha}^*)^2}\] where $\bar{\alpha}^*$ is the average of all $B$ bootstrap estimates. $\bar{\alpha}^*$ 是所有 $B$ 个自举估计值的平均值。

The slides (p. 73, 77-78) show this visually. The “sampling from population” histogram (left) is the true sampling distribution, which we can only create in a simulation. The “Bootstrap” histogram (right) is the bootstrap distribution created from one sample. They look very similar, which shows the method works. “从总体抽样”直方图（左图）是真实的抽样分布，我们只能在模拟中创建它。“自举”直方图（右图）是从一个样本创建的自举分布。它们看起来非常相似，这表明该方法有效。

Code Analysis

R: $\alpha$ Example (Slides 75 & 77)

Slide 75 (The R code): This is a SIMULATION, not Bootstrap.
- for(i in 1:m){...}: This loop runs m=1000 times.
- returns <- rmvnorm(...): Inside the loop, it draws a brand new sample from the true population every time.
- alpha[i] <- ...: It calculates $\hat{\alpha}$ for each new sample.
- Purpose: This code shows the true sampling distribution of $\hat{\alpha}$ (the “Histogram of alpha”). You can only do this if you know the true population, as in a simulation.
Slide 77 (The R code): This IS Bootstrap.
- returns <- rmvnorm(...): Outside the loop, this is done only once to get one original sample.
- for(i in 1:B){...}: This is the bootstrap loop.
- sample(1:nrow(returns), n, replace = T): This is the key line. It randomly selects row numbers with replacement from the single returns dataset.
- returns_boot <- returns[sample(...), ]: This creates the bootstrap sample.
- alpha_bootstrap[i] <- ...: It calculates $\hat{\alpha}^*$ on the returns_boot sample.
- Purpose: This code generates the bootstrap distribution (the “Bootstrap” histogram on slide 78) to estimate the true sampling distribution.

R: Linear Regression Example (Slides 79 & 81)

Slide 79:
- boot.fn <- function(data, index){ ... }: Defines a function that the boot package needs. It takes data and an index vector.
- lm(mpg~horsepower, data=data, subset=index): This is the core. It fits a linear model only on the data points specified by the index. The boot function will automatically supply this index as a resampled-with-replacement vector.
- boot(Auto, boot.fn, R=1000): This runs the bootstrap. It calls boot.fn 1000 times, each time with a new resampled index, and collects the coefficients.
Slide 81:
- summary(lm(...)): Shows the standard output. The “Std. Error” column (e.g., 0.860, 0.006) is calculated using mathematical theory.
- boot.res: Shows the bootstrap output. The “std. error” column (e.g., 0.841, 0.007) is the standard deviation of the 1000 bootstrap estimates.
- Main Point: The standard errors from the bootstrap are very close to the theoretical ones. This confirms the uncertainty. If the model assumptions were violated, the bootstrap SE would be more trustworthy.
- The histograms show the bootstrap distributions for the intercept (t1*) and the slope (t2*). The arrows show the 95% percentile confidence interval.

Python: KNN Regression Example (Slide 80)

This shows how to get a confidence interval for a single prediction.

for i in range(n_bootstraps):: The bootstrap loop.
indices = np.random.choice(train_samples.shape[0], train_samples.shape[0], replace=True): This is the key line in Python (like sample in R). It gets a new set of indices with replacement.
X_boot, y_boot = ...: Creates the bootstrap sample.
model.fit(X_boot, y_boot): A new KNN model is trained on this bootstrap sample.
bootstrap_preds.append(model.predict(predict_point)): The model (trained on $Z^{*i}$) makes a prediction for the same fixed point. This is repeated 1000 times.
Result: You get a distribution of predictions for that one point. The 2.5th and 97.5th percentiles of this distribution give you a 95% confidence interval for that specific prediction. 你会得到该点的预测分布。该分布的 2.5 和 97.5 百分位数为该特定预测提供了 95% 的置信区间。

Python: KNN on Auto data (Slide 82)

BE CAREFUL: This slide does NOT show Bootstrap. It shows K-Fold Cross-Validation (CV).
Purpose: The goal here is not to find uncertainty. The goal is to find the best hyperparameter (the best value for $k$, the number of neighbors).
Method:
- kf = KFold(n_splits=10): Splits the data into 10 chunks (“folds”).
- for train_index, test_index in kf.split(X):: It loops 10 times. Each time, it trains on 9 chunks and tests on 1 chunk.
Key Difference for Exam:
- Bootstrap: Samples with replacement to estimate uncertainty/standard error.
- Cross-Validation: Splits data without replacement into $K$ folds to estimate model performance/prediction error and tune hyperparameters.
- 自举法：使用有放回的样本来估计不确定性/标准误差。
- 交叉验证：将数据无放回地分成 $K$ 份，以估计模型性能/预测误差并调整超参数。

7. The mathematical theory of Bootstrap and the extension to Cross-Validation (CV).

1. Code Analysis: Bootstrap for a KNN Prediction (Slide 85)

This Python code shows a different use of bootstrap: finding the confidence interval for a single prediction, not for a model coefficient.

Goal: To estimate the uncertainty of a KNN model’s prediction for a specific new data point (predict_point).
Process:
1. Train Full Model: A KNN model (knn) is first trained on the entire dataset. It makes one prediction (knpred) for predict_point. This is our $\hat{f}(x_0)$.
2. Bootstrap Loop (for i in range(n_bootstraps)):
  - indices = np.random.choice(...): This is the core bootstrap step. It creates a new list of indices by sampling with replacement from the original data.
  - X_boot, y_boot = ...: This creates the new bootstrap dataset ($Z^{*i}$).
  - km.fit(X_boot, y_boot): A new KNN model (km) is trained only on this bootstrap sample.
  - bootstrap_preds.append(km.predict(predict_point)): This newly trained model makes a prediction for the same predict_point. This value is $\hat{f}^{*i}(x_0)$.
3. Analyze Distribution: After 1000 loops, bootstrap_preds contains 1000 different predictions for the same point.
4. Confidence Interval:
  - np.percentile(bootstrap_preds, [2.5, 97.5]): This finds the 2.5th and 97.5th percentiles of the 1000 bootstrap predictions.
  - The resulting [lower_bound, upper_bound] (e.g., [13.70, 15.70]) forms the 95% confidence interval for the prediction.
Histogram Plot: The plot on the right visually confirms this. It shows the distribution of the 1000 bootstrap predictions, with the 95% confidence interval marked by the red dashed lines.

2. Mathematical Understanding: Why Does Bootstrap Work? (Slides 87-88)

This is the theoretical justification for the entire method. It’s based on an analogy. 这是整个方法的理论依据。它基于一个类比。

The “True” World (Slide 87, Top)

Population: There is a true, unknown population distribution $F$. 存在一个真实的、未知的总体分布 $F$。
Parameter: We want to know a true parameter, $\theta$, which is a function of $F$ (e.g., the true population mean). 我们想知道一个真实的参数 $\theta$，它是 $F$ 的函数（例如，真实的总体均值）。
Sample: We get one sample $X_1, ..., X_n$ from $F$. 我们从 $F$ 中获取一个样本 $X_1, ..., X_n$。
Statistic: We calculate our best estimate $\hat{\theta}$ from our sample. (e.g., the sample mean $\bar{x}$). $\hat{\theta}$ is our proxy for $\theta$. 我们从样本中计算出最佳估计值 $\hat{\theta}$。（例如，样本均值 $\bar{x}$）。$\hat{\theta}$ 是 $\theta$ 的替代值。
The Problem: We want to know the accuracy of $\hat{\theta}$. How much would $\hat{\theta}$ vary if we could draw many samples? We want the sampling distribution of $\hat{\theta}$ around $\theta$, specifically the distribution of the error: $(\hat{\theta} - \theta)$. 我们想知道 $\hat{\theta}$ 的准确率。如果我们可以抽取多个样本，$\hat{\theta}$ 会有多少变化？我们想要 $\hat{\theta}$ 围绕 $\theta$ 的 抽样分布，具体来说是误差的分布：$(\hat{\theta} - \theta)$。
CLT: The Central Limit Theorem states that $\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{\text{dist}} N(0, Var_F(\theta))$.
中心极限定理：$\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{\text{dist}} N(0, Var_F(\theta))$。
The Catch: This is UNKNOWN because we don’t know $F$.这是未知的，因为我们不知道 $F$。

The “Bootstrap” World (Slide 87, Bottom)

Population: We pretend our original sample is the population. We call its distribution the “empirical distribution,” $\hat{F}_n$. 我们假设原始样本就是总体。我们称其分布为“经验分布”，即 $\hat{F}_n$。
Parameter: In this new world, the “true” parameter is our original statistic, $\hat{\theta}$ (which is a function of $\hat{F}_n$). 在这个新世界中，“真实”参数是我们原始的统计量 $\hat{\theta}$（它是 $\hat{F}_n$ 的函数）。
Sample: We draw many bootstrap samples $X_1^*, ..., X_n^*$ from $\hat{F}_n$ (i.e., sampling with replacement from our original sample). 我们从 $\hat{F}_n$* 中抽取许多自举样本 $X_1^*, ..., X_n^*$（即从原始样本中进行 有放回 抽样）。
Statistic: From each bootstrap sample, we calculate a bootstrap statistic, $\hat{\theta}^*$. 从每个自举样本中，我们计算一个 自举统计量，即 $\hat{\theta}^*$。
The Solution: We can now empirically find the distribution of $\hat{\theta}^*$ around $\hat{\theta}$. We look at the distribution of the bootstrap error: $(\hat{\theta}^* - \hat{\theta})$. 我们现在可以 凭经验 找到 $\hat{\theta}^*$ 围绕 $\hat{\theta}$ 的分布。我们来看看自举误差的分布：$(\hat{\theta}^* - \hat{\theta})$。
CLT: The CLT also states that $\sqrt{n}(\hat{\theta}^* - \hat{\theta}) \xrightarrow{\text{dist}} N(0, Var_{\hat{F}_n}(\theta))$.
The Power: This distribution is ESTIMABLE! We just run the bootstrap $B$ times and we get $B$ values of $\hat{\theta}^*$. We can then calculate their variance, standard deviation, and percentiles directly. 这个分布是可估计的！我们只需运行 $B$ 次自举程序，就能得到 $B$ 个 $\hat{\theta}^*$ 值。然后我们可以直接计算它们的方差、标准差和百分位数。

The Core Approximation (Slide 88)

The entire method relies on the assumption that the (knowable) bootstrap distribution is a good approximation of the (unknown) true sampling distribution. 整个方法依赖于以下假设：（已知的）自举分布能够很好地近似（未知的）真实抽样分布。

The distribution of the bootstrap error approximates the distribution of the true error. 自举误差的分布近似于真实误差的分布。

\[\text{distribution of } \sqrt{n}(\hat{\theta}^* - \hat{\theta}) \approx \text{distribution of } \sqrt{n}(\hat{\theta} - \theta)\]

This is why: * The standard deviation of the $\hat{\theta}^*$ values is our estimate for the standard error of $\hat{\theta}$. 值的标准差是我们对 $\hat{\theta}$ 的标准误差的估计值。 * The percentiles of the $\hat{\theta}^*$ distribution (e.g., 2.5th and 97.5th) can be used to build a confidence interval for the true parameter $\theta$. 分布的百分位数（例如，第 2.5 个和第 97.5 个）可用于为真实参数 $\theta$ 建立置信区间。

3. Extension: Cross-Validation (CV) Analysis

CV for Hyperparameter Tuning (Slide 84) 超参数调优的 CV

This plot is the result of the 10-fold CV code shown in the previous set of slides (slide 82). * Purpose: To find the optimal hyperparameter $k$ (number of neighbors) for the KNN model. * X-axis: Number of Neighbors ($k$). * Y-axis: CV Error (Mean Squared Error). * Analysis: * Low $k$ (e.g., $k=1, 2$): High error. The model is too complex and overfitting to the training data. * High $k$ (e.g., $k>40$): Error slowly increases. The model is too simple and underfitting (e.g., averaging too many neighbors). * Optimal $k$: The “sweet spot” is at the bottom of the “U” shape, around $k \approx 20-30$, which gives the lowest CV error.

目的：为 KNN 模型找到最优超参数 $k$（邻居数）。
X 轴：邻居数 ($k$)。
Y 轴：CV 误差（均方误差）。
分析：**
低 $k$（例如，$k=1, 2$）：误差较大。模型过于复杂，并且与训练数据过拟合。
高 $k$（例如，$k>40$）：误差缓慢增加。模型过于简单且欠拟合（例如，对太多邻居进行平均）。
最优 $k$：“最佳点”位于“U”形的底部，大约为$k \approx 20-30$，此时 CV 误差最低。

Why CV Over-Estimates Test Error (Slide 89)

This is a subtle but important theoretical point. * Our Goal: We want to know the test error of our final model ($\hat{f}^{\text{full}}$), which we will train on the full dataset (all $n$ observations). 我们想知道最终模型 ($\hat{f}^{\text{full}}$) 的测试误差，我们将在完整数据集（所有 $n$ 个观测值）上训练该模型。 * What CV Measures: $k$-fold CV does not test the final model. It tests $k$ different models ($\hat{f}^{(k)}$), each trained on a smaller dataset (of size $\frac{k-1}{k} \times n$). $k$ 倍 CV 不测试最终模型。它测试了 $k$ 个不同的模型 ($\hat{f}^{(k)}$)，每个模型都基于一个较小的数据集（大小为 $\frac{k-1}{k} \times n$）进行训练。

The Logic:
1. Models trained on less data generally perform worse than models trained on more data. 基于较少数据训练的模型通常比基于较多数据训练的模型表现更差。
2. The CV error is the average error of models trained on $\frac{k-1}{k} n$ observations. CV 误差是使用 $\frac{k-1}{k} n$ 个观测值训练的模型的平均误差。
3. The “true test error” is the error of the model trained on $n$ observations. “真实测试误差”是使用 $n$ 个观测值训练的模型的误差。
Conclusion: Since the CV models are trained on smaller datasets, they will, on average, have a slightly higher error than the final model. Therefore, the CV error score is a slightly pessimistic estimate (it over-estimates) the true test error of the final model. 由于 CV 模型是在较小的数据集上训练的，因此它们的平均误差会略高于最终模型。因此，CV 误差分数是一个略微悲观的估计（它高估了）最终模型的真实测试误差。

Correction of CV Error (Slides 90-91)

Theory (Slide 91): Advanced theory suggests the expected test error $R(n)$ behaves like $R(n) = R^* + c/n$, where $R^*$ is the irreducible error and $n$ is the sample size. This formula mathematically confirms that error decreases as sample size $n$ increases. 高级理论表明，预期测试误差 $R(n)$ 的行为类似于 $R(n) = R^* + c/n$，其中 $R^*$ 是不可约误差，$n$ 是样本量。该公式从数学上证实了误差会随着样本量 $n$ 的增加而减小。
R Code (Slide 90): The cv.glm function from the boot library automatically provides this.
- cv.err$delta: This output vector contains two values.
- [1] 24.23151 (Raw CV Error): This is the standard Leave-One-Out CV (LOOCV) error.
- [2] 24.23114 (Adjusted CV Error): This is a bias-corrected estimate that accounts for the overestimation problem. It’s slightly lower, representing a more accurate guess for the error of the final model trained on all $n$ data points.

# The “Correction of CV Error” extension.

Summary

This section provides a deeper mathematical look at why k-fold cross-validation (CV) slightly over-estimates the true test error. 本节从数学角度更深入地阐述了 为什么 k 折交叉验证 (CV) 会略微高估真实测试误差。

The Overestimation 高估: CV trains on $\frac{k-1}{k}$ of the data, which is less than the full dataset (size $n$). Models trained on less data are generally worse. Therefore, the average error from CV ($CV_k$) is slightly higher (more pessimistic) than the true error of the final model trained on all $n$ data ($R(n)$). CV 训练的数据为 $\frac{k-1}{k}$，小于完整数据集（大小为 $n$）。使用较少数据训练的模型通常更差。因此，CV 的平均误差 ($CV_k$) 略高于（更悲观地）基于所有 $n$ 个数据训练的最终模型的真实误差 ($R(n)$)。
A Simple Correction 简单修正: A mathematical formula, $\tilde{CV_k} = \frac{k-1}{k} \cdot CV_k$, is proposed to “correct” this overestimation.
The Critical Flaw 关键缺陷: This correction is derived assuming the irreducible error ($R^*$) is zero.此修正是在假设不可约误差 ($R^*$) 为零的情况下得出的。
The Takeaway 要点 (Code Analysis): The Python code demonstrates a real-world scenario where there is noise (noise_std = 0.5), meaning $R^* > 0$. In this case, the simple correction fails—it produces an error (0.217) that is less accurate and further from the true error (0.272) than the original raw CV error (0.271).

Python 代码演示了一个存在噪声（noise_std = 0.5）的真实场景，即 $R^* > 0$。在这种情况下，简单修正失败——它产生的误差 (0.217) 精度较低，并且与真实误差 (0.272) 的距离比原始 CV 误差 (0.271) 更远。

Exam Conclusion: For most real-world problems (which have noise), the raw $k$-fold CV error is a better and more reliable estimate of the true test error than the simple (and flawed) correction. 对于大多数实际问题（包含噪声），原始 $k$ 倍 CV 误差比简单（且有缺陷的）修正方法更能准确、可靠地估计真实测试误差。

Mathematical Understanding

This section explains the theory of why $CV_k > R(n)$ and derives the simple correction. 本节解释了为什么 $CV_k > R(n)$，并推导出简单的修正方法。

Assumed Error Behavior 假设误差行为: We assume the test error $R(n)$ for a model trained on $n$ data points behaves like: 我们假设基于 $n$ 个数据点训练的模型的测试误差 $R(n)$ 的行为如下： \[R(n) = R^* + \frac{c}{n}\]
- $R^*$: The irreducible error (the “noise floor” you can never beat). 不可约误差（即你永远无法克服的“本底噪声”）。
- $c/n$: The model variance, which decreases as sample size $n$ increases. 模型方差，随着样本量 $n$ 的增加而减小。
Test Error vs. CV Error 测试误差 vs. CV 误差:
- Test Error of Interest: This is the error of our final model trained on all $n$ points: \[R(n) = R^* + \frac{c}{n}\]
- 感兴趣的测试误差：这是我们在所有 $n$ 个点上训练的最终模型的误差：
- k-fold CV Error: This is the average error of $k$ models, each trained on a smaller sample of size $n' = (\frac{k-1}{k})n$.
- k 倍 CV 误差：这是 $k$ 个模型的平均误差，每个模型都使用一个较小的样本（大小为 $n' = (\frac{k-1}{k})n$）进行训练。 \[CV_k \approx R(n') = R\left(\frac{k-1}{k}n\right) = R^* + \frac{c}{\left(\frac{k-1}{k}\right)n} = R^* + \frac{ck}{(k-1)n}\]
The Overestimation 高估: Let’s compare $CV_k$ and $R(n)$: \[CV_k \approx R^* + \left(\frac{k}{k-1}\right) \frac{c}{n}\] \[R(n) = R^* + \left(\frac{k-1}{k-1}\right) \frac{c}{n}\] Since $k > (k-1)$, the factor $\left(\frac{k}{k-1}\right)$ is greater than 1. This means the $CV_k$ error term is larger than the $R(n)$ error term. Thus: $CV_k > \text{Test error of interest } R(n)$ 由于 $k > (k-1)$，因子 $\left(\frac{k}{k-1}\right)$ 大于 1。这意味着 $CV_k$ 误差项大于 $R(n)$ 误差项。因此： $CV_k > \text{目标测试误差 } R(n)$
Deriving the (Flawed) Correction 推导（有缺陷的）修正: This correction makes a strong assumption: $R^* \approx 0$ (the model is perfectly specified, and there is no noise). 此修正基于一个强假设：$R^* \approx 0$（模型完全正确，且无噪声）。
- If $R^* = 0$, then $R(n) \approx \frac{c}{n}$
- If $R^* = 0$, then $CV_k \approx \frac{ck}{(k-1)n}$
Now, look at the ratio between them: \[\frac{R(n)}{CV_k} \approx \frac{c/n}{ck/((k-1)n)} = \frac{c}{n} \cdot \frac{(k-1)n}{ck} = \frac{k-1}{k}\]

This gives us the correction formula by isolating $R(n)$: 通过分离 $R(n)$，我们得到了校正公式： \[R(n) \approx \left(\frac{k-1}{k}\right) \cdot CV_k\] This corrected version is denoted $\tilde{CV_k}$.这个校正版本表示为 $\tilde{CV_k}$。

Code Analysis (Slides 92-93)

The Python code is an experiment designed to test the correction formula.

Goal: Compare the “Raw CV Error” ($CV_k$), the “Corrected CV Error” ($\tilde{CV_k}$), and the “True Test Error” ($R(n)$) in a realistic setting.
Key Setup:
1. def f(x): Defines the true, underlying function $y = x^2 + 15\sin(x)$.
2. noise_std = 0.5: This is the most important line. It adds significant random noise to the data. This ensures that the irreducible error $R^*$ is large and $R^* > 0$.
3. y = f(...) + np.random.normal(...): Creates the noisy training data (the blue dots).
CV Calculation (Standard K-Fold):
- kf = KFold(...): Sets up 5-fold CV ($k=5$).
- for train_index, val_index in kf.split(x):: This is the standard loop. It trains on 4 folds and validates on 1 fold.
- cv_error = np.mean(cv_mse_list): Calculates the raw $CV_5$ error. This is the first result (e.g., 0.2715).
Correction Calculation:
- correction_factor = (k_splits - 1) / k_splits: This is $\frac{k-1}{k}$, which is $4/5 = 0.8$.
- corrected_cv_error = correction_factor * cv_error: This applies the flawed formula from the math section ($0.2715 \times 0.8$). This is the second result (e.g., 0.2172).
“True” Test Error Calculation:
- knn.fit(x, y): Trains the final model on the entire noisy dataset.
- n_test = 1000: Creates a new, large test set to estimate the true error.
- true_test_error = mean_squared_error(...): Calculates the error of the final model on this new test set. This is our best estimate of $R(n)$ (e.g., 0.2725).
Analysis of Results (Slide 93):
- Raw 5-Fold CV MSE: 0.2715
- True test error: 0.2725
- Corrected 5-Fold CV MSE: 0.2172
The Raw CV Error (0.2715) is an excellent estimate of the True Test Error (0.2725). The Corrected Error (0.2172) is much worse. This experiment proves that when noise ($R^*$) is present, the simple correction formula should not be used.

MSDM 5054 - Statistical Machine Learning-L4

发表于 2025-10-01 更新于 2025-10-18 分类于 Machine Learning

统计机器学习Lecture-4

Lecturer: Prof.XIA DONG

1. What is Classification?

Classification is a type of supervised machine learning where the goal is to predict a categorical or qualitative response. Unlike regression where you predict a continuous numerical value (like a price or temperature), classification assigns an input to a specific category or class. 分类是一种监督式机器学习，其目标是预测分类或定性响应。与预测连续数值（例如价格或温度）的回归不同，分类将输入分配到特定的类别或类别。

Key characteristics:

Goal: Predict the class of a subject based on input features.
Output (Response): The output is a category, such as ‘Yes’/‘No’, ‘Spam’/‘Not Spam’, or ‘High’/‘Medium’/‘Low’.
Applications: Common examples include email spam detectors, medical diagnosis (e.g., virus carrier vs. non-carrier), and fraud detection.
- 目标：根据输入特征预测主题的类别。
- 输出（响应）：输出是一个类别，例如“是”/“否”、“垃圾邮件”/“非垃圾邮件”或“高”/“中”/“低”。
- 应用：常见示例包括垃圾邮件检测器、医学诊断（例如，病毒携带者与非病毒携带者）和欺诈检测。 The example used in the slides is a credit card Default dataset. The goal is to predict whether a customer will default (‘Yes’ or ‘No’) on their payments based on their monthly income and account balance.

## Why Not Use Linear Regression?为什么不使用线性回归？

At first, it might seem possible to use linear regression for classification. For a binary (two-class) problem like the default dataset, you could code the outcomes as numbers, for example:

Default = ‘No’ => $y = 0$
Default = ‘Yes’ => $y = 1$

You could then fit a standard linear regression model: $Y \approx \beta_0 + \beta_1 X$. In this context, we would interpret the prediction $\hat{y}$ as the probability of default, so we’d be modeling $P(Y=1|X) = \beta_0 + \beta_1 X$.

However, this approach has two major problems: 然而，这种方法有两个主要问题： 1. The Output Is Not a Probability A linear model can produce outputs that are less than 0 or greater than 1. This doesn’t make sense for a probability, which must always be between 0 and 1.

The image below is the most important one for understanding this issue. The left plot shows a linear regression line fit to the 0/1 default data. You can see the line goes below 0 and would eventually go above 1 for higher balances. The right plot shows a logistic regression curve, which always stays between 0 and 1.

Left (Linear Regression): The straight blue line predicts probabilities < 0 for low balances.
Right (Logistic Regression): The S-shaped blue curve correctly constrains the probability output between 0 and 1.

2. It Doesn’t Work for Multi-Class Problems If you have more than two categories (e.g., ‘mild’, ‘moderate’, ‘severe’), you might code them as 0, 1, and 2. A linear regression model would incorrectly assume that the “distance” between ‘mild’ and ‘moderate’ is the same as the distance between ‘moderate’ and ‘severe’, which is usually not a valid assumption.

1. 输出不是概率 线性模型可以产生小于 0 或大于 1 的输出。这对于概率来说毫无意义，因为概率必须始终介于 0 和 1 之间。

下图是理解这个问题最重要的图。左图显示了与 0/1 默认数据拟合的线性回归线。您可以看到，该线低于 0，并且最终会随着余额的增加而高于 1。右图显示了逻辑回归曲线，它始终保持在 0 和 1 之间。

左图（线性回归）：蓝色直线预测低余额的概率小于 0。
右图（逻辑回归）：S 形蓝色曲线正确地将概率输出限制在 0 和 1 之间。

2.它不适用于多类别问题 如果您有两个以上的类别（例如，“轻度”、“中度”、“重度”），您可能会将它们编码为 0、1 和 2。线性回归模型会错误地假设“轻度”和“中度”之间的“距离”与“中度”和“重度”之间的距离相同，这通常不是一个有效的假设。

## The Solution: Logistic Regression

Instead of modeling the response $y$ directly, logistic regression models the probability that $y$ belongs to a particular class. To solve the issue of the output not being a probability, it uses the logistic function (also known as the sigmoid function).

This function takes any real-valued input and squeezes it into an output between 0 and 1.

The formula for the probability in a logistic regression model is: \[P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\] This S-shaped function, shown in the right-hand plot above, ensures that the output is always a valid probability. We can then set a threshold (e.g., 0.5) to make the final class prediction. If $P(Y=1|X) > 0.5$, we predict ‘Yes’; otherwise, we predict ‘No’.

## 解决方案：逻辑回归

逻辑回归不是直接对响应 $y$ 进行建模，而是对 $y$ 属于特定类别的概率进行建模。为了解决输出不是概率的问题，它使用了逻辑函数（也称为 S 型函数）。

此函数接受任何实值输入，并将其压缩为介于 0 和 1 之间的输出。

逻辑回归模型中的概率公式为： \[P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\] 如上图右侧所示，这个 S 形函数确保输出始终是有效概率。然后，我们可以设置一个阈值（例如 0.5）来进行最终的类别预测。如果 $P(Y=1|X) > 0.5$，则预测“是”；否则，预测“否”。

## Data Visualization & Code in Python

The slides use R to visualize the data. The boxplots are particularly important because they show which variable is a better predictor.

Balance vs. Default: The boxplots for balance show a clear difference. The median balance for those who default (‘Yes’) is much higher than for those who do not (‘No’). This suggests balance is a strong predictor.
Income vs. Default: The boxplots for income show a lot of overlap. The median incomes for both groups are very similar. This suggests income is a weak predictor.
余额 vs. 违约：余额的箱线图显示出明显的差异。违约者（“是”）的余额中位数远高于未违约者（“否”）。这表明余额是一个强有力的预测指标。
收入 vs. 违约：收入的箱线图显示出很大的重叠。两组的收入中位数非常相似。这表明收入是一个弱的预测指标。

Here’s how you could perform similar analysis and modeling in Python using seaborn and scikit-learn.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assume 'default_data.csv' has columns: 'default' (Yes/No), 'balance', 'income'
# You would load your data like this:
# df = pd.read_csv('default_data.csv')

# For demonstration, let's create some sample data
data = {
    'balance': [1200, 2100, 800, 1800, 500, 1600, 2200, 1900],
    'income': [45000, 60000, 30000, 55000, 25000, 48000, 70000, 65000],
    'default': ['No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)

# --- 1. Data Visualization (like the slides) ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Predictor Analysis for Default')

# Boxplot for Balance
sns.boxplot(ax=axes[0], x='default', y='balance', data=df)
axes[0].set_title('Balance vs. Default Status')

# Boxplot for Income
sns.boxplot(ax=axes[1], x='default', y='income', data=df)
axes[1].set_title('Income vs. Default Status')

plt.show()

# --- 2. Logistic Regression Modeling ---

# Convert categorical 'default' column to 0s and 1s
df['default_encoded'] = df['default'].apply(lambda x: 1 if x == 'Yes' else 0)

# Define features (X) and target (y)
X = df[['balance', 'income']]
y = df['default_encoded']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on new data
# For example, a person with a $2000 balance and $50,000 income
new_customer = [[2000, 50000]]
predicted_prob = model.predict_proba(new_customer)
prediction = model.predict(new_customer)

print(f"Customer data: Balance=2000, Income=50000")
print(f"Probability of No Default vs. Default: {predicted_prob}") # [[P(No), P(Yes)]]
print(f"Final Prediction (0=No, 1=Yes): {prediction}")

2. the mathematical foundation of logistic regression

This set of slides explains the mathematical foundation of logistic regression, how its parameters are estimated using Maximum Likelihood Estimation (MLE), and how an iterative algorithm called Newton-Raphson is used to perform this estimation.

逻辑回归的数学基础、如何使用最大似然估计 (MLE) 估计其参数，以及如何使用名为 Newton-Raphson 的迭代算法进行估计。

2.1 The Logistic Regression Model: From Probabilities to Log-Odds逻辑回归模型：从概率到对数几率

The core of logistic regression is transforming a linear model into a valid probability. This is done using the logistic function, also known as the sigmoid function. 逻辑回归的核心是将线性模型转换为有效的概率。这可以通过逻辑函数（也称为 S 型函数）来实现。 #### Key Mathematical Formulas

Probability of Class 1: The model assumes the probability of an observation $\mathbf{x}$ belonging to class 1 is given by the sigmoid function: \[ P(y=1|\mathbf{x}) = \frac{1}{1 + \exp(-\beta^T \mathbf{x})} = \frac{\exp(\beta^T \mathbf{x})}{1 + \exp(\beta^T \mathbf{x})} \] This function always outputs a value between 0 and 1, making it perfect for modeling probabilities.
Odds: The odds are the ratio of the probability of an event happening to the probability of it not happening. \[ \text{Odds} = \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \exp(\beta^T \mathbf{x}) \]
Log-Odds (Logit): By taking the natural logarithm of the odds, we get a linear relationship with the predictors. This is called the logit transformation. \[ \text{logit}(P(y=1|\mathbf{x})) = \log\left(\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})}\right) = \beta^T \mathbf{x} \] This final equation is the heart of the model. It states that the log-odds of the outcome are a linear function of the predictors. This provides a great interpretation: a one-unit increase in a predictor $x_j$ changes the log-odds by $\beta_j$.
类别 1 的概率：该模型假设观测值 $\mathbf{x}$ 属于类别 1 的概率由 S 型函数给出： \[ P(y=1|\mathbf{x}) = \frac{1}{1 + \exp(-\beta^T \mathbf{x})} = \frac{\exp(\beta^T \mathbf{x})}{1 + \exp(\beta^T \mathbf{x})} \] 此函数的输出值始终介于 0 和 1 之间，非常适合用于概率建模。
几率：**几率是事件发生的概率与不发生的概率之比。 \[ \text{Odds} = \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \exp(\beta^T \mathbf{x}) \]
对数概率 (Logit)：通过对概率取自然对数，我们可以得到概率与预测变量之间的线性关系。这被称为logit 变换。 \[ \text{logit}(P(y=1|\mathbf{x})) = \log\left(\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})}\right) = \beta^T \mathbf{x} \] 最后一个方程是模型的核心。它指出结果的对数概率是预测变量的线性函数。这提供了一个很好的解释：预测变量 $x_j$ 每增加一个单位，对数概率就会改变 $\beta_j$。

2.2 Fitting the Model: Maximum Likelihood Estimation (MLE) 拟合模型：最大似然估计 (MLE)

Unlike linear regression, which uses least squares to find the best-fit line, logistic regression uses Maximum Likelihood Estimation (MLE). The goal of MLE is to find the parameter values (the $\beta$ coefficients) that maximize the probability of observing the actual data that we have. 与使用最小二乘法寻找最佳拟合线的线性回归不同，逻辑回归使用最大似然估计 (MLE)。MLE 的目标是找到使观测到实际数据的概率最大化的参数值（$\beta$ 系数）。

Likelihood Function: This is the joint probability of observing all the data points in our sample. Assuming each observation is independent, it’s the product of the individual probabilities: 1.似然函数：这是观测到样本中所有数据点的联合概率。假设每个观测值都是独立的，它是各个概率的乘积： \[ L(\beta) = \prod_{i=1}^{n} P(y_i|\mathbf{x}_i) \] A clever way to write this for a binary (0/1) outcome is: \[ L(\beta) = \prod_{i=1}^{n} \frac{\exp(y_i \beta^T \mathbf{x}_i)}{1 + \exp(\beta^T \mathbf{x}_i)} \]
Log-Likelihood Function: Products are difficult to work with mathematically, so we work with the logarithm of the likelihood, which turns the product into a sum. Maximizing the log-likelihood is the same as maximizing the likelihood.
对数似然函数：乘积在数学上很难处理，所以我们使用似然的对数，将乘积转化为和。最大化对数似然与最大化似然相同。 \[ \ell(\beta) = \log(L(\beta)) = \sum_{i=1}^{n} \left[ y_i \beta^T \mathbf{x}_i - \log(1 + \exp(\beta^T \mathbf{x}_i)) \right] \] Key Takeaway: The slides correctly state that there is no explicit formula to solve for the $\hat{\beta}$ that maximizes this function. We must find it using a numerical optimization algorithm. 没有明确的公式来求解最大化该函数的$\hat{\beta}$。我们必须使用数值优化算法来找到它。

2.3 The Algorithm: Newton-Raphson 算法：牛顿-拉夫森算法

The slides introduce the Newton-Raphson algorithm as the method to find the optimal $\hat{\beta}$. It’s an efficient iterative algorithm for finding the roots of a function (i.e., where $f(x)=0$).

How does this apply to logistic regression? To maximize the log-likelihood function $\ell(\beta)$, we need to find the point where its derivative (gradient) is equal to zero. So, Newton-Raphson is used to solve $\frac{d\ell(\beta)}{d\beta} = 0$.

它是一种高效的迭代算法，用于求函数的根（即，当$f(x)=0$时）。

这如何应用于逻辑回归？ 为了最大化对数似然函数 $\ell(\beta)$，我们需要找到其导数（梯度）等于零的点。因此，牛顿-拉夫森法用于求解 $\frac{d\ell(\beta)}{d\beta} = 0$。

The General Newton-Raphson Method

The algorithm starts with an initial guess, $x^{old}$, and iteratively refines it using the following update rule, which is based on a Taylor series approximation: \[ x^{new} = x^{old} - \frac{f(x^{old})}{f'(x^{old})} \] where $f'(x)$ is the derivative of $f(x)$. You repeat this step until the value of $x$ converges.

该算法从初始估计 $x^{old}$ 开始，并使用以下基于泰勒级数近似的更新规则迭代地对其进行优化： \[ x^{new} = x^{old} - \frac{f(x^{old})}{f'(x^{old})} \] 其中 $f'(x)$ 是 $f(x)$ 的导数。重复此步骤，直到 $x$ 的值收敛。

Important Image: Newton-Raphson Example ($x^3 - 4 = 0$)

[Image showing iterations of Newton-Raphson]

This slide is a great illustration of the algorithm’s power. * Goal: Find $x$ such that $f(x) = x^3 - 4 = 0$. * Function: $f(x) = x^3 - 4$ * Derivative: $f'(x) = 3x^2$ * Update Rule: $x^{new} = x^{old} - \frac{(x^{old})^3 - 4}{3(x^{old})^2}$ Starting with a guess of $x^{old} = 2$, the algorithm converges to the true answer ($4^{1/3} \approx 1.5874$) in just 4 steps.

目标：找到 $x$，使得 $f(x) = x^3 - 4 = 0$。
函数：$f(x) = x^3 - 4$
导数：$f'(x) = 3x^2$
更新规则：$x^{new} = x^{old} - \frac{(x^{old})^3 - 4}{3(x^{old})^2}$ 从 $x^{old} = 2$ 的猜测开始，该算法仅用 4 步就收敛到真实答案 ($4^{1/3} \approx 1.5874$)。

Code Understanding (Python)

The slides show Python code implementing Newton-Raphson. Let’s break down the key function.

import numpy as np

# Define the function we want to find the root of
def f(x):
    return np.exp(x) - x*x + 3 * np.sin(x)

# Define its derivative
def f_prime(x):
    return np.exp(x) - 2*x + 3 * np.cos(x)

# Newton-Raphson method
def newton_raphson(x0, tol=1e-10, max_iter=100):
    x = x0 # Start with the initial guess
    for i in range(max_iter):
        fx = f(x)      # Calculate f(x_old)
        fpx = f_prime(x) # Calculate f'(x_old)

        if fpx == 0: # Cannot divide by zero
            print("Zero derivative. No solution found.")
            return None

        # This is the core update rule
        x_new = x - fx / fpx

        # Check if the change is small enough to stop
        if abs(x_new - x) < tol:
            print(f"Converged to {x_new} after {i+1} iterations.")
            return x_new

        # Update x for the next iteration
        x = x_new
    
    print("Exceeded maximum iterations. No solution found.")
    return None

# Initial guess and execution
x0 = 0.5
root = newton_raphson(x0)

The slides show that with a good initial guess (x0 = 0.5), the algorithm converges quickly. With a bad one (x0 = 50), it still converges but takes many more steps. This highlights the importance of the starting point. The slides also show an implementation of Gradient Descent, another popular optimization algorithm which uses the update rule x_new = x - learning_rate * gradient.

Provide a great case study on logistic regression, particularly on the important concept of confounding variables. Here’s a summary covering the math, code, and key insights.

Core Concept: Logistic Regression 📈 # 核心概念：逻辑回归 📈

Logistic regression is a statistical method used for binary classification, which means predicting an outcome that can only be one of two things (e.g., Yes/No, True/False, 1/0).

In this example, the goal is to predict the probability that a customer will default on a loan (Yes or No) based on factors like their account balance, income, and whether they are a student.

The core of logistic regression is the sigmoid (or logistic) function, which takes any real-valued number and squishes it to a value between 0 and 1, representing a probability.

\[ \hat{P}(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}} \]

$\hat{P}(Y=1|X)$ is the predicted probability of the outcome being “Yes” (e.g., default).
$\beta_0$ is the intercept.
$\beta_1, ..., \beta_p$ are the coefficients for each input variable ($X_1, ..., X_p$). The model’s job is to find the best values for these $\beta$ coefficients.

逻辑回归是一种用于二元分类的统计方法，这意味着预测结果只能是两种情况之一（例如，是/否、真/假、1/0）。

在本例中，目标是根据客户账户“余额”、“收入”以及是否为“学生”等因素，预测客户拖欠贷款（是或否）的概率。

逻辑回归的核心是Sigmoid（或逻辑）函数，它将任何实数压缩为介于 0 和 1 之间的值，以表示概率。

\[ \hat{P}(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}} \]

$\hat{P}(Y=1|X)$ 是结果为“是”（例如，默认）的预测概率。
$\beta_0$ 是截距。
$\beta_1, ..., \beta_p$ 是每个输入变量 ($X_1, ..., X_p$) 的系数。模型的任务是找到这些 $\beta$ 系数的最佳值。

3.1 How the Model “Learns” (Mathematical Foundation)

The slides show that the model’s coefficients ($\beta$) are found using an algorithm like Newton-Raphson. This is an iterative process to find the values that maximize the log-likelihood function. Think of this as finding the coefficient values that make the observed data most probable.这是一个迭代过程，用于查找最大化对数似然函数的值。可以将其视为查找使观测数据概率最大的系数值。

The key slide for this is the one titled “Newton-Raphson Iterative Algorithm”. It shows the formulas for: * The Gradient ($\nabla\ell$): The direction of the steepest ascent of the log-likelihood function. * The Hessian ($H$): The curvature of the log-likelihood function.

梯度 ($\nabla\ell$)：对数似然函数最陡上升的方向。
黑森矩阵 ($H$)：对数似然函数的曲率。

The updating rule is given by: \[ \beta^{new} = \beta^{old} - H^{-1}\nabla\ell \] This formula is used repeatedly until the coefficient values stop changing significantly, meaning the algorithm has converged to the best fit. This process is also referred to as Iteratively Reweighted Least Squares (IRLS). 此公式反复使用，直到系数值不再发生显著变化，这意味着算法已收敛到最佳拟合值。此过程也称为迭代重加权最小二乘法 (IRLS)。

3.2 The Puzzle: A Tale of Two Models 🕵️‍♂️

The most important story in these slides is how the effect of being a student changes depending on the model. This is a classic example of a confounding variable.

Model 1: Simple Logistic Regression (Default vs. Student)

When predicting default using only student status, the model is: default ~ student

From the slides, the coefficients are: * Intercept ($\beta_0$): -3.5041 * student[Yes] ($\beta_1$): 0.4049 (positive)

The equation for the log-odds is: \[ \log\left(\frac{P(\text{default})}{1-P(\text{default})}\right) = -3.5041 + 0.4049 \times (\text{is\_student}) \]

Conclusion: The positive coefficient (0.4049) suggests that students are more likely to default than non-students. The slides calculate the probabilities: * Student Default Probability: 4.31% * Non-Student Default Probability: 2.92%

学生身份的影响如何根据模型而变化。这是一个典型的混杂变量的例子。

模型 1：简单逻辑回归（违约 vs. 学生）

仅使用学生身份预测违约时，模型为： default ~ student

幻灯片中显示的系数为： * 截距 ($\beta_0$): -3.5041 * 学生[是] ($\beta_1$): 0.4049（正）

对数概率公式为： \[ \log\left(\frac{P(\text{default})}{1-P(\text{default})}\right) = -3.5041 + 0.4049 \times (\text{is\_student}) \]

结论：正系数 (0.4049) 表明学生比非学生更有可能违约。幻灯片计算了以下概率： * 学生违约概率：4.31% * 非学生违约概率：2.92%

3.3 Model 2: Multiple Logistic Regression (Default vs. All Variables) 模型 2：多元逻辑回归（违约 vs. 所有变量）

When we add balance and income to the model, it becomes: default ~ student + balance + income

From the slides, the new coefficients are: * Intercept ($\beta_0$): -10.8690 * balance ($\beta_1$): 0.0057 * income ($\beta_2$): 0.0030 * student[Yes] ($\beta_3$): -0.6468 (negative)

The Shocking Twist! The coefficient for student[Yes] is now negative.

Conclusion: When we control for balance and income, students are actually less likely to default than non-students with the same balance and income.

Why the Change? The Confounding Variable Explained

The key insight, explained on the slide with multi-colored text bubbles, is that students, on average, have higher credit card balances.

In the simple model, the student variable was inadvertently capturing the risk associated with having a high balance. The model mistakenly concluded “being a student causes default.”
In the multiple model, the balance variable properly accounts for the risk from a high balance. With that effect isolated, the student variable can show its true, underlying relationship with default, which is negative.

This demonstrates why it’s crucial to consider multiple relevant variables to avoid drawing incorrect conclusions. The most important slides are the ones that present this paradox and its explanation.

令人震惊的转折！ student[Yes] 的系数现在为负。

结论：当我们控制余额和收入时，学生实际上比具有相同余额和收入的非学生更低于违约。

为什么会有变化？混杂变量解释

幻灯片上用彩色文字气泡解释了关键的见解，即学生平均拥有更高的信用卡余额。

在简单模型中，“学生”变量无意中捕捉到了高余额带来的风险。该模型错误地得出了“学生身份导致违约”的结论。
在多元模型中，“余额”变量正确地解释了高余额带来的风险。在分离出这一影响后，“学生”变量可以显示其与违约之间真实的潜在关系，即负相关关系。

这说明了为什么考虑多个相关变量以避免得出错误结论至关重要。

Code Implementation: R vs. Python

The slides use R’s glm() (Generalized Linear Model) function. Here’s how you would replicate this in Python.

R Code (from slides)

# Simple Model
glmod2 <- glm(default ~ student, data=Default, family=binomial)
summary(glmod2)

# Multiple Model
glmod3 <- glm(default ~ ., data=Default, family=binomial) # '.' means all other variables
summary(glmod3)

Python Equivalent

We can use two popular libraries: statsmodels (which gives R-style summaries) and scikit-learn (the standard for machine learning).

import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

# Assume 'Default' is a pandas DataFrame with columns:
# 'default' (0/1), 'student' (0/1), 'balance', 'income'

# --- Using statsmodels (recommended for interpretation) ---

# Prepare the data
# For statsmodels, we need to manually add the intercept
X_simple = Default[['student']]
X_simple = sm.add_constant(X_simple)
y = Default['default']

X_multiple = Default[['student', 'balance', 'income']]
X_multiple = sm.add_constant(X_multiple)

# Simple Model: default ~ student
model_simple = sm.Logit(y, X_simple).fit()
print("--- Simple Model ---")
print(model_simple.summary())

# Multiple Model: default ~ student + balance + income
model_multiple = sm.Logit(y, X_multiple).fit()
print("\n--- Multiple Model ---")
print(model_multiple.summary())


# --- Using scikit-learn (recommended for prediction tasks) ---

# Prepare the data (scikit-learn adds intercept by default)
X_simple_sk = Default[['student']]
y_sk = Default['default']

X_multiple_sk = Default[['student', 'balance', 'income']]

# Simple Model
clf_simple = LogisticRegression().fit(X_simple_sk, y_sk)
print(f"\nSimple Model Intercept (scikit-learn): {clf_simple.intercept_}")
print(f"Simple Model Coefficient (scikit-learn): {clf_simple.coef_}")

# Multiple Model
clf_multiple = LogisticRegression().fit(X_multiple_sk, y_sk)
print(f"\nMultiple Model Intercept (scikit-learn): {clf_multiple.intercept_}")
print(f"Multiple Model Coefficients (scikit-learn): {clf_multiple.coef_}")

4 Making Predictions and the Decision Boundary 🎯进行预测和决策边界

Once the model is trained (i.e., we have the coefficients $\hat{\beta}$), we can make predictions. 一旦模型训练完成（即，我们有了系数 $\hat{\beta}$），我们就可以进行预测了。 ## Math Behind Predictions

The model outputs the log-odds, which can be converted into a probability. A key concept is the decision boundary, which is the threshold where the model is uncertain (probability = 50%). 模型输出对数概率，它可以转换为概率。一个关键概念是决策边界，它是模型不确定的阈值（概率 = 50%）。

The Estimated Odds: The core output of the linear part of the model is the exponential of the linear equation, which gives the odds of the outcome being ‘Yes’ (or 1). 估计概率：模型线性部分的核心输出是线性方程的指数，它给出了结果为“是”（或 1）的概率。

\[ \]$$\frac{\hat{P}(y=1|\mathbf{x}_0)}{\hat{P}(y=0|\mathbf{x}_0)} = \exp(\hat{\beta}^\top \mathbf{x}_0)

\[ \]\[ \]
The Decision Rule: We classify a new observation $\mathbf{x}_0$ by comparing its predicted odds to a threshold $\delta$. 决策规则：我们通过比较新观测值 $\mathbf{x}_0$ 的预测概率与阈值 $\delta$ 来对其进行分类。
- Predict $y=1$ if $\exp(\hat{\beta}^\top \mathbf{x}_0) > \delta$
- Predict $y=0$ if $\exp(\hat{\beta}^\top \mathbf{x}_0) < \delta$ A common default is $\delta=1$, which means we predict ‘Yes’ if the probability is greater than 0.5.
The Linear Boundary: The decision boundary itself is where the odds are exactly equal to the threshold. By taking the logarithm, we see that this boundary is a linear equation. This is why logistic regression is called a linear classifier. 线性边界：决策边界本身就是概率恰好等于阈值的地方。取对数后，我们发现这个边界是一个线性方程。这就是逻辑回归被称为线性分类器的原因。 \[ \]$$\hat{\beta}^\top \mathbf{x} = \log(\delta)

\[ \]$$For $\delta=1$, the boundary is simply $\hat{\beta}^\top \mathbf{x} = 0$.

This concept is visualized perfectly in the slide titled “Linear Classifier,” which shows a straight line neatly separating two classes of data points. 题为“线性分类器”的幻灯片完美地展示了这一概念，它展示了一条直线，将两类数据点巧妙地分隔开来。

Visualizing the Confounding Effect

The most important image in this set is Figure 4.3, as it visually explains the confounding puzzle from the first set of slides.

Right Panel (Boxplots): This shows that students (Yes) tend to have higher credit card balances than non-students (No). This is the source of the confounding.
Left Panel (Default Rates):
- The dashed lines show the overall default rates. The orange line (students) is higher than the blue line (non-students). This matches our simple model (default ~ student).
- The solid S-shaped curves show the probability of default as a function of balance. For any given balance, the blue curve (non-students) is slightly higher than the orange curve (students). This means that at the same level of debt, students are less likely to default. This matches our multiple regression model (default ~ student + balance + income).

This single figure brilliantly illustrates how a variable can appear to have one effect in isolation but the opposite effect when controlling for a confounding factor. * 右侧面板（箱线图）：这表明学生（是）的信用卡余额往往高于非学生（否）。这就是混杂效应的根源。 * 左图（违约率）： * 虚线显示总体违约率。橙色线（学生）高于蓝色线（非学生）。这与我们的简单模型（“违约 ~ 学生”）相符。 * S 形实线显示违约概率与余额的关系。对于任何给定的余额，蓝色曲线（非学生）略高于橙色曲线（学生）。这意味着在相同的债务水平下，学生违约的可能性较小。这与我们的多元回归模型（“违约 ~ 学生 + 余额 + 收入”）相符。

这张图巧妙地说明了为什么一个变量在单独使用时似乎会产生一种影响，但在控制混杂因素后却会产生相反的影响。

An Important Edge Case: Perfect Separation ⚠️

What happens if the data can be perfectly separated by a straight line? 如果数据可以用一条直线完美分离，会发生什么？

One might think this is the ideal scenario, but it causes a problem for the logistic regression algorithm. The model will try to find coefficients that make the probabilities for each class as close to 1 and 0 as possible. To do this, the magnitude of the coefficients ($\hat{\beta}$) must grow infinitely large. 人们可能认为这是理想情况，但它会给逻辑回归算法带来问题。模型会尝试找到使每个类别的概率尽可能接近 1 和 0 的系数。为此，系数 ($\hat{\beta}$) 的大小必须无限大。

The slide “Non-convergence for perfectly separated case” demonstrates this:

The Code: It generates two distinct, non-overlapping clusters of data points using Python’s scikit-learn.
Parameter Estimates Graph: It shows the Intercept, Coefficient 1, and Coefficient 2 values increasing or decreasing without limit as the algorithm runs through more iterations. They never converge to a stable value.
Decision Boundary Graph: The decision boundary itself might look reasonable, but the underlying coefficients are unstable.
代码：它使用 Python 的 scikit-learn 生成两个不同的、不重叠的数据点聚类。
参数估计图：它显示“截距”、“系数 1”和“系数 2”的值随着算法迭代次数的增加或减少而无限增大或减小。它们永远不会收敛到一个稳定的值。
决策边界图：决策边界本身可能看起来合理，但底层系数是不稳定的。

Key Takeaway: If your logistic regression model fails to converge, the first thing you should check for is perfect separation in your training data. 关键要点：如果您的逻辑回归模型未能收敛，您应该检查的第一件事就是训练数据是否完美分离。

Code Understanding

The slides provide useful code snippets in both R and Python.

R Code (Plotting Predictions)

This code generates the plot with the two S-shaped curves (one for students, one for non-students) showing the probability of default as balance increases.

// # Create a data frame for prediction with a range of balances
// # One version for students, one for non-students
Default.st <- data.frame(balance=seq(500, 2500, by=1), student="Yes")
Default.nonst <- data.frame(balance=seq(500, 2500, by=1), student="No")

// # Use the trained multiple regression model (glmod3) to predict probabilities
pred.st <- predict(glmod3, Default.st, type="response")
pred.nonst <- predict(glmod3, Default.nonst, type="response")

// # Plot the results
plot(Default.st$balance, pred.st, type="l", col="red", ...) // Students
lines(Default.nonst$balance, pred.nonst, col="blue", ...) // Non-students

Python Code (Visualizing the Decision Boundary)

This Python code uses scikit-learn and matplotlib to create the plot showing the linear decision boundary.

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# 1. Generate synthetic data with two classes
X, y = make_classification(...)

# 2. Initialize and fit the logistic regression model
model = LogisticRegression()
model.fit(X, y)

# 3. Create a mesh grid of points to make predictions over the entire plot area
xx, yy = np.meshgrid(...)

# 4. Predict the probability for each point on the grid
probs = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

# 5. Plot the decision boundary where the probability is 0.5
plt.contour(xx, yy, probs.reshape(xx.shape), levels=[0.5], ...)

# 6. Scatter plot the actual data points
plt.scatter(X[:, 0], X[:, 1], c=y, ...)
plt.show()

Other Important Remarks

The “Remarks” slide briefly mentions some key extensions:

Probit Model: An alternative to logistic regression that uses the cumulative distribution function (CDF) of the standard normal distribution instead of the sigmoid function. The results are often very similar.
Softmax Regression: An extension of logistic regression used for multi-class classification (when there are more than two possible outcomes).
Probit 模型：逻辑回归的替代方法，它使用标准正态分布的累积分布函数 (CDF) 代替 S 型函数。结果通常非常相似。
Softmax 回归：逻辑回归的扩展，用于多类分类（当存在两个以上可能结果时）。

5. Here is a summary of the slides on Linear Discriminant Analysis (LDA), including the key mathematical formulas, visual explanations, and how to implement it in Python.

The Main Idea: Classification Using Probabilities 使用概率进行分类

Linear Discriminant Analysis (LDA) is a classification method. For a given input x, it calculates the probability that x belongs to each class and then assigns x to the class with the highest probability.

It does this using Bayes’ Theorem, which provides a formula for the posterior probability $P(Y=k | X=x)$, or the probability that the class is $k$ given the input $x$. 线性判别分析 (LDA) 是一种分类方法。对于给定的输入 x，它计算 x 属于每个类别的概率，然后将 x 分配给概率最高的类别。

它使用贝叶斯定理来实现这一点，该定理提供了后验概率 $P(Y=k | X=x)$ 的公式，即给定输入 $x$，该类别属于 $k$ 的概率。 \[ p_k(x) = P(Y=k|X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^{K} \pi_l f_l(x)} \]

$p_k(x)$ is the posterior probability we want to maximize.
$\pi_k = P(Y=k)$ is the prior probability of class $k$ (how common the class is overall).
$f_k(x) = f(x|Y=k)$ is the class-conditional probability density function of observing input $x$ if it belongs to class $k$.

To classify a new observation $x$, we simply find the class $k$ that makes $p_k(x)$ the largest. 为了对新的观察值 $x$ 进行分类，我们只需找到使 $p_k(x)$ 最大的类别 $k$ 即可。

Key Assumptions of LDA

LDA’s power comes from a specific, simplifying assumption about the data’s distribution. LDA 的强大之处在于它对数据分布进行了特定的简化假设。

Gaussian Distribution: LDA assumes that the data within each class $k$ follows a p-dimensional multivariate normal (or Gaussian) distribution, denoted as $X|Y=k \sim \mathcal{N}(\mu_k, \Sigma)$.
Common Covariance: A crucial assumption is that all classes share the same covariance matrix $\Sigma$. This means that while the classes may have different centers (means, $\mu_k$), their shape and orientation (covariance, $\Sigma$) are identical.
高斯分布：LDA 假设每个类 $k$ 中的数据服从 p 维多元正态（或高斯）分布，表示为 $X|Y=k \sim \mathcal{N}(\mu_k, \Sigma)$。
共同协方差：一个关键假设是所有类共享相同的协方差矩阵 $\Sigma$。这意味着虽然类可能具有不同的中心（均值，$\mu_k$），但它们的形状和方向（协方差，$\Sigma$）是相同的。

The probability density function for a class $k$ is: \[ f_k(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}} \exp \left( -\frac{1}{2}(x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right) \]

The image above (from your slide “Knowing normal distribution”) illustrates this. The two “bells” have different centers (different $\mu_k$) but similar shapes. The one on the right is “tilted,” indicating correlation between variables, which is captured in the shared covariance matrix $\Sigma$. 上图（摘自幻灯片“了解正态分布”）说明了这一点。两个“钟”形的中心不同（$\mu_k$ 不同），但形状相似。右边的钟形“倾斜”，表示变量之间存在相关性，这体现在共享协方差矩阵 $\Sigma$ 中。

The Math Behind LDA: The Discriminant Function 判别函数

Since we only need to find the class $k$ that maximizes the posterior probability $p_k(x)$, we can simplify the math. The denominator in Bayes’ theorem is the same for all classes, so we only need to maximize the numerator: $\pi_k f_k(x)$. 由于我们只需要找到使后验概率 $p_k(x)$ 最大化的类别 $k$，因此可以简化数学计算。贝叶斯定理中的分母对于所有类别都是相同的，因此我们只需要最大化分子：$\pi_k f_k(x)$。 Taking the logarithm (which doesn’t change which class is maximal) and removing constant terms gives us the linear discriminant function, $\delta_k(x)$: 取对数（这不会改变哪个类别是最大值）并移除常数项，得到线性判别函数，$\delta_k(x)$：

\[ \delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log(\pi_k) \]

This function is linear in $x$, which is why the method is called Linear Discriminant Analysis. The decision boundary between any two classes, say class $k$ and class $l$, is the set of points where $\delta_k(x) = \delta_l(x)$, which defines a linear hyperplane. 该函数关于 $x$ 是线性的，因此该方法被称为线性判别分析。任意两个类别（例如类别 $k$ 和类别 $l$）之间的决策边界是满足 $\delta_k(x) = \delta_l(x)$ 的点的集合，这定义了一个线性超平面。

The image above (from your “Graph of LDA” slide) is very important. * Left: The ellipses show the true 95% probability contours for three Gaussian classes. The dashed lines are the ideal Bayes decision boundaries, which are perfectly linear because the assumption of common covariance holds. * Right: This shows a sample of data points drawn from those distributions. The solid lines are the LDA decision boundaries calculated from the sample. They are a very good estimate of the ideal boundaries. 上图（来自您的“LDA 图”幻灯片）非常重要。 * 左图：椭圆显示了三个高斯类别的真实 95% 概率轮廓。虚线是理想的贝叶斯决策边界，由于共同协方差假设成立，因此它们是完美的线性。 * 右图：这显示了从这些分布中抽取的数据点样本。实线是根据样本计算出的 LDA 决策边界。它们是对理想边界的非常好的估计。 ***

Practical Implementation: Estimating the Parameters 实际应用：估计参数

In a real-world scenario, we don’t know the true parameters ($\mu_k$, $\Sigma$, $\pi_k$). Instead, we estimate them from our training data ($n$ total samples, with $n_k$ samples in class $k$). 在实际场景中，我们不知道真正的参数（$\mu_k$、$\Sigma$、$\pi_k$）。相反，我们根据训练数据（$n$ 个样本，$n_k$ 个样本属于 $k$ 类）来估计它们。

Prior Probability ($\hat{\pi}_k$): The proportion of training samples in class $k$. \[\hat{\pi}_k = \frac{n_k}{n}\]
Class Mean ($\hat{\mu}_k$): The average of the training samples in class $k$. \[\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i=k} x_i\]
Common Covariance ($\hat{\Sigma}$): A weighted average of the sample covariance matrices for each class. This is often called the “pooled” covariance. \[\hat{\Sigma} = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T\]
先验概率 ($\hat{\pi}_k$)：训练样本在 $k$ 类中的比例。 \[\hat{\pi}_k = \frac{n_k}{n}\]
类别均值 ($\hat{\mu}_k$)：训练样本在 $k$ 类中的平均值。 \[\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i=k} x_i\]
公共协方差 ($\hat{\Sigma}$)：每个类的样本协方差矩阵的加权平均值。这通常被称为“合并”协方差。 \[\hat{\Sigma} = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T\]

We then plug these estimates into the discriminant function to get $\hat{\delta}_k(x)$ and classify a new observation $x$ to the class with the largest score. 然后，我们将这些估计值代入判别函数，得到 $\hat{\delta}_k(x)$，并将新的观测值 $x$ 归类到得分最高的类别。 ***

Evaluating Performance

After training the model, we evaluate its performance using a confusion matrix. 训练模型后，我们使用混淆矩阵来评估其性能。

This matrix shows the true classes versus the predicted classes. * Diagonal elements (9644, 81) are correct predictions. * Off-diagonal elements (23, 252) are errors. 该矩阵显示了真实类别与预测类别的对比。 * 对角线元素 (9644, 81) 表示正确预测。 * 非对角线元素 (23, 252) 表示错误预测。

From this matrix, we can calculate key metrics: * Overall Error Rate: Total incorrect predictions / Total predictions. * Example: $(252 + 23) / 10000 = 2.75\%$ * Sensitivity (True Positive Rate): Correctly predicted positives / Total actual positives. It answers: “Of all the people who actually defaulted, what fraction did we catch?” * Example: $81 / 333 = 24.3\%$. The sensitivity is $1 - 75.7\% = 24.3\%$. * Specificity (True Negative Rate): Correctly predicted negatives / Total actual negatives. It answers: “Of all the people who did not default, what fraction did we correctly identify?” * Example: $9644 / 9667 = 99.8\%$. The specificity is $1 - 0.24\% = 99.8\%$.

The example in your slides shows a high error rate for “default” people (75.7%) because the classes are unbalanced—there are far fewer defaulters. This highlights the importance of looking at class-specific metrics, not just the overall error rate.

Python Code Understanding

In Python, you can easily implement LDA using the scikit-learn library. The code conceptually mirrors the steps we discussed.

import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# Assume you have your data X (features) and y (labels)
# X = features (e.g., balance, income)
# y = labels (e.g., 0 for 'no-default', 1 for 'default')

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Create an instance of the LDA model
lda = LinearDiscriminantAnalysis()

# 3. Fit the model to the training data
# This is where the model calculates the estimates:
#  - Prior probabilities (pi_k)
#  - Class means (mu_k)
#  - Pooled covariance matrix (Sigma)
lda.fit(X_train, y_train)

# 4. Make predictions on new, unseen data
predictions = lda.predict(X_test)

# 5. Evaluate the model's performance
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

print("\nClassification Report:")
print(classification_report(y_test, predictions))

LinearDiscriminantAnalysis() creates the classifier object.
lda.fit(X_train, y_train) is the core training step where the model learns the $\hat{\pi}_k$, $\hat{\mu}_k$, and $\hat{\Sigma}$ parameters from the data.
lda.predict(X_test) uses the learned discriminant function $\hat{\delta}_k(x)$ to classify each sample in the test set.
confusion_matrix and classification_report are tools to evaluate the results, just like in the slides.

6. Here is a summary of the provided slides on Linear Discriminant Analysis (LDA), focusing on mathematical concepts, Python code interpretation, and key visuals.

Core Concept: LDA for Classification

Linear Discriminant Analysis (LDA) is a classification method that models the probability that an observation belongs to a certain class. It works by finding a linear combination of features that best separates two or more classes.

The decision is based on Bayes’ theorem. For a given observation with features $X=x$, LDA calculates the posterior probability, $p_k(x) = Pr(Y=k|X=x)$, for each class $k$. This is the probability that the observation belongs to class $k$ given its features. 线性判别分析 (LDA) 是一种分类方法，它对观测值属于某个类别的概率进行建模。它的工作原理是找到能够最好地区分两个或多个类别的特征的线性组合。

该决策基于贝叶斯定理。对于特征为 $X=x$ 的给定观测值，LDA 会计算每个类别 $k$ 的后验概率，$p_k(x) = Pr(Y=k|X=x)$。这是给定观测值的特征后，该观测值属于类别 $k$ 的概率。

By default, the Bayes classifier assigns an observation to the class with the highest posterior probability. For a binary (two-class) problem like ‘Yes’ vs. ‘No’, this means: 默认情况下，贝叶斯分类器将观测值分配给后验概率最高的类别。对于像“是”与“否”这样的二分类问题，这意味着：

Assign to ‘Yes’ if $Pr(Y=\text{Yes}|X=x) > 0.5$
Assign to ‘No’ otherwise

Modifying the Decision Threshold

The default 0.5 threshold isn’t always optimal. In many real-world scenarios, the cost of one type of error is much higher than another. For example, in credit card default prediction: 默认的 0.5 阈值并非总是最优的。在许多实际场景中，一种错误的代价远高于另一种。例如，在信用卡违约预测中：

False Negative: Incorrectly classifying a person who will default as someone who won’t. (The bank loses money).
False Positive: Incorrectly classifying a person who won’t default as someone who will. (The bank loses a potential customer).

A bank might decide that missing a defaulter is much worse than denying a good customer. To catch more potential defaulters, they can lower the probability threshold. 银行可能会认为错过一个违约者比拒绝一个优质客户更糟糕。为了捕捉更多潜在的违约者，他们可以降低概率阈值。

A modified rule could be: \[ Pr(\text{default}=\text{Yes}|X=x) > 0.2 \] This makes the model more “sensitive” to flagging potential defaulters, even at the cost of misclassifying more non-defaulters. 降低阈值会提高敏感度，但会降低特异性。

This decision leads to a trade-off between two key performance metrics: * Sensitivity (True Positive Rate): The ability to correctly identify positive cases. (e.g., Correctly identified defaulters / Total actual defaulters). * Specificity (True Negative Rate): The ability to correctly identify negative cases. (e.g., Correctly identified non-defaulters / Total actual non-defaulters).

这一决策会导致两个关键绩效指标之间的权衡： * 敏感度（真阳性率）：正确识别阳性案例的能力。（例如，“正确识别的违约者/实际违约者总数”）。 * 特异性（真阴性率）：正确识别阴性案例的能力。（例如，“正确识别的非违约者/实际非违约者总数”）。

Lowering the threshold increases sensitivity but decreases specificity. ## Python Code Explained

The slides show how to implement and adjust LDA using Python’s scikit-learn library.

Basic LDA Implementation

# Import the necessary library
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Initialize and train the LDA model
lda = LinearDiscriminantAnalysis()
lda_train = lda.fit(X, y)

# Get predictions using the default 0.5 threshold
y_pred = lda.predict(X)

This code trains an LDA model and makes predictions using the standard 50% probability boundary.

Adjusting the Prediction Threshold

To use a custom threshold (e.g., 0.2), you don’t use the .predict() method. Instead, you get the class probabilities with .predict_proba() and apply the threshold manually.

# 1. Get the probabilities for each class
# lda.predict_proba(X) returns an array like [[P(No), P(Yes)], ...]
# We select the second column [:, 1] for the 'Yes' class probability
lda_probs = lda.predict_proba(X)[:, 1]

# 2. Define a custom threshold
threshold = 0.2

# 3. Apply the threshold to get new predictions
# This creates a boolean array (True where prob > 0.2, else False)
# We then convert True/False to 'Yes'/'No' labels
lda_pred1 = np.where(lda_probs > threshold, "Yes", "No")

This is the core technique for tuning the classifier’s behavior to meet specific business needs, as demonstrated on slides 55 and 56 for both LDA and Logistic Regression.

Important Images to Understand

Confusion Matrix (Slide 49): This table is crucial. It breaks down the model’s predictions into True Positives, True Negatives, False Positives, and False Negatives. All key metrics like error rate, sensitivity, and specificity are calculated from this matrix. 混淆矩阵（幻灯片 49）：这张表至关重要。它将模型的预测分解为真阳性、真阴性、假阳性和假阴性。所有关键指标，例如错误率、灵敏度和特异性，都基于此矩阵计算得出。
LDA Decision Boundaries (Slide 51): This plot provides a powerful visual intuition. It shows the data points for two classes and the decision boundary line. The different parallel lines show how changing the threshold from 0.5 to 0.1 or 0.9 shifts the boundary, making the model classify more or fewer points into the minority class. LDA 决策边界（幻灯片 51）：这张图提供了强大的视觉直观性。它展示了两个类别的数据点和决策边界线。不同的平行线显示了将阈值从 0.5 更改为 0.1 或 0.9 时边界如何移动，从而使模型将更多或更少的点归入少数类
Error Rate Tradeoff Curve (Slide 53): This graph is the most important for understanding the business implication of changing the threshold. It clearly shows that as the threshold changes, the error rate for one class goes down while the error rate for the other goes up. The overall error is minimized at a certain point, but that may not be the optimal point from a business perspective. 错误率权衡曲线（幻灯片 53）：这张图对于理解更改阈值的业务含义至关重要。它清楚地表明，随着阈值的变化，一个类别的错误率下降，而另一个类别的错误率上升。总体误差在某个点达到最小，但从业务角度来看，这可能并非最佳点。
ROC Curve (Slides 54 & 55): The Receiver Operating Characteristic (ROC) curve plots Sensitivity vs. (1 - Specificity) for all possible thresholds. An ideal classifier has a curve that “hugs” the top-left corner, indicating high sensitivity and high specificity. It’s a standard way to visualize and compare the overall performance of different classifiers. ROC 曲线（幻灯片 54 和 55）： 接收者操作特性 (ROC) 曲线绘制了所有可能阈值的灵敏度与（1 - 特异性）的关系。理想的分类器曲线“紧贴”左上角，表示高灵敏度和高特异性。这是可视化和比较不同分类器整体性能的标准方法。

7. Here is a summary of the provided slides on Linear and Quadratic Discriminant Analysis, including the key formulas, Python code equivalents, and explanations of the important concepts.

Key Goal: Classification

Both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are classification algorithms. Their main goal is to find a decision boundary to separate different classes (e.g., “default” vs. “not default”) in the data. 线性判别分析 (LDA) 和 二次判别分析 (QDA) 都是分类算法。它们的主要目标是找到一个决策边界来区分数据中的不同类别（例如，“默认”与“非默认”）。

## Linear Discriminant Analysis (LDA)

LDA creates a linear decision boundary between classes. LDA 在类别之间创建线性决策边界。

Core Idea (Fisher’s Interpretation)

Imagine you have data points for different classes in a 3D space. Fisher’s idea is to find the best angle to shine a “flashlight” on the data to project its shadow onto a 2D wall (or a 1D line). The “best” projection is the one where the shadows of the different classes are as far apart from each other as possible, while the shadows within each class are as tightly packed as possible. 想象一下，你在三维空间中拥有不同类别的数据点。Fisher 的思想是找到最佳角度，用“手电筒”照射数据，将其阴影投射到二维墙壁（或一维线上）。 “最佳”投影是不同类别的阴影彼此之间尽可能远，而每个类别内的阴影尽可能紧密的投影。

Maximize: The distance between the means of the projected classes (Between-Class Variance). 投影类别均值之间的距离（类间方差）。
Minimize: The spread or variance within each projected class (Within-Class Variance). 每个投影类别内的扩散或方差（类内方差）。 This is the most important image for understanding the intuition behind LDA. It shows how projecting the data onto a specific line (defined by vector w) can make the two classes clearly separable. 这是理解LDA背后直觉的最重要图像。它展示了如何将数据投影到特定直线（由向量“w”定义）上，从而使两个类别清晰可分。

Key Mathematical Formulas

To achieve this, LDA maximizes a ratio called the Rayleigh quotient. LDA最大化一个称为瑞利商的比率。

Within-Class Covariance ($\hat{\Sigma}_W$): Measures the spread of data inside each class. 类内协方差 ($\hat{\Sigma}_W$)：衡量每个类别内部数据的扩散程度。 \[\hat{\Sigma}_W = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^\top\]
Between-Class Covariance ($\hat{\Sigma}_B$): Measures the spread between the means of different classes. 类间协方差 ($\hat{\Sigma}_B$)：衡量不同类别均值之间的差异。 \[\hat{\Sigma}_B = \sum_{k=1}^{K} n_k (\hat{\mu}_k - \hat{\mu})(\hat{\mu}_k - \hat{\mu})^\top\]
Objective Function: Find the projection vector $w$ that maximizes the ratio of between-class variance to within-class variance. 目标函数：找到投影向量 $w$，使类间方差与类内方差之比最大化。 \[\max_w \frac{w^\top \hat{\Sigma}_B w}{w^\top \hat{\Sigma}_W w}\]

LDA’s Main Assumption

The key assumption of LDA is that all classes share the same covariance matrix ($\Sigma$). They can have different means ($\mu_k$), but their spread and orientation must be identical. This assumption is what results in a linear decision boundary. LDA 的关键假设是所有类别共享相同的协方差矩阵 ($\Sigma$)。它们可以具有不同的均值 ($\mu_k$)，但它们的散度和方向必须相同。正是这一假设导致了线性决策边界。

## Quadratic Discriminant Analysis (QDA)

QDA is a more flexible extension of LDA that creates a quadratic (curved) decision boundary. QDA 是 LDA 的更灵活的扩展，它创建了二次（曲线）决策边界。 #### Core Idea & Key Assumption

QDA starts with the same principles as LDA but drops the key assumption. QDA assumes that each class has its own unique covariance matrix ($\Sigma_k$). QDA 的原理与 LDA 相同，但放弃了关键假设。QDA 假设每个类别都有自己独特的协方差矩阵 ($\Sigma_k$)。

This means each class can have its own spread, shape, and orientation. This additional flexibility allows for a more complex, curved decision boundary. 这意味着每个类别可以拥有自己的散度、形状和方向。这种额外的灵活性使得决策边界更加复杂、曲线化。

Key Mathematical Formula

The classification is made using a discrimination function, $\delta_k(x)$. We assign a data point $x$ to the class $k$ for which $\delta_k(x)$ is largest. The function for QDA is: \[\delta_k(x) = -\frac{1}{2}(x - \mu_k)^\top \Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}\log(|\Sigma_k|) + \log \pi_k\] The term containing $x^\top \Sigma_k^{-1} x$ makes this function a quadratic function of $x$.

## LDA vs. QDA: The Trade-Off

The choice between LDA and QDA is a classic bias-variance trade-off. 在 LDA 和 QDA 之间进行选择是典型的偏差-方差权衡。

Use LDA when:
- The assumption of a common covariance matrix is reasonable (the classes have similar shapes).
- You have a small amount of training data, as LDA is less prone to overfitting.
- Simplicity is preferred. LDA is less flexible (high bias) but has lower variance.
- 假设共同协方差矩阵是合理的（类别具有相似的形状）。
- 训练数据量较少，因为 LDA 不易过拟合。
- 简洁是首选。LDA 灵活性较差（偏差较大），但方差较小。
Use QDA when:
- The classes have clearly different shapes and spreads (different covariance matrices).
- You have a large amount of training data to properly estimate the separate covariance matrices for each class.
- QDA is more flexible (low bias) but can have high variance, meaning it might overfit on smaller datasets.
- 类别具有明显不同的形状和分布（不同的协方差矩阵）。
- 拥有大量训练数据，可以正确估计每个类别的独立协方差矩阵。
- QDA 更灵活（偏差较小），但方差较大，这意味着它可能在较小的数据集上过拟合。 Rule of Thumb: If the class variances are equal or close, LDA is better. Otherwise, QDA is better. 经验法则： 如果类别方差相等或接近，则 LDA 更佳。否则，QDA 更好。

## Code Understanding (Python Equivalent)

The slides show code in R. Here’s how you would perform LDA and evaluate it in Python using the popular scikit-learn library.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, auc
import matplotlib.pyplot as plt

# Assume 'df' is your DataFrame with features and a 'target' column
# X = df.drop('target', axis=1)
# y = df['target']

# 1. Split data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Fit an LDA model (equivalent to lda() in R)
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# 3. Make predictions (equivalent to predict() in R)
y_pred_lda = lda.predict(X_test)

# To fit a QDA model, the process is identical:
# qda = QuadraticDiscriminantAnalysis()
# qda.fit(X_train, y_train)
# y_pred_qda = qda.predict(X_test)

# 4. Create a confusion matrix (equivalent to table())
print("LDA Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lda))

# 5. Plot the ROC Curve (equivalent to the R code for ROC)
# Get prediction probabilities for the positive class
y_pred_proba = lda.predict_proba(X_test)[:, 1]

# Calculate ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Calculate Area Under the Curve (AUC)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'LDA ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--') # Random guess line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

Understanding the ROC Curve

The ROC Curve is another important image. It helps you visualize a classifier’s performance across all possible classification thresholds. ROC 曲线 是另一个重要的图像。它可以帮助您直观地了解分类器在所有可能的分类阈值下的性能。

The Y-axis is the True Positive Rate (Sensitivity): “Of all the actual positives, how many did we correctly identify?”
The X-axis is the False Positive Rate: “Of all the actual negatives, how many did we incorrectly label as positive?”
A perfect classifier would have a curve that goes straight up to the top-left corner (100% TPR, 0% FPR). The diagonal line represents a random guess. The Area Under the Curve (AUC) summarizes the model’s performance; a value closer to 1.0 is better.
Y 轴 表示真阳性率（敏感度）：“在所有实际的阳性样本中，我们正确识别了多少个？”
X 轴 表示假阳性率：“在所有实际的阴性样本中，我们错误地将多少个标记为阳性？”
一个完美的分类器应该有一条直线上升到左上角的曲线（真阳性率 100%，假阳性率 0%）。对角线表示随机猜测。曲线下面积 (AUC) 概括了模型的性能；该值越接近 1.0 越好。

8. Here is a summary of the provided slides on Quadratic Discriminant Analysis (QDA), including the key formulas, code explanations with Python equivalents, and a guide to the most important images.

## Core Concept: QDA vs. LDA

The main difference between Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) lies in their assumptions about the data. 线性判别分析 (LDA) 和 二次判别分析 (QDA) 的主要区别在于它们对数据的假设。 * LDA assumes that all classes share the same covariance matrix ($\Sigma$). It models each class as a normal distribution with a different mean ($\mu_k$) but the same shape and orientation. This results in a linear decision boundary between classes. 假设所有类别共享相同的协方差矩阵 ($\Sigma$)。它将每个类别建模为均值不同 ($\mu_k$) 但形状和方向相同的正态分布。这会导致类别之间出现线性决策边界。 * QDA is more flexible. It assumes that each class $k$ has its own, separate covariance matrix ($\Sigma_k$). This allows each class’s distribution to have a unique shape, size, and orientation. This flexibility results in a quadratic decision boundary (like a parabola, hyperbola, or ellipse). 更灵活。它假设每个类别 $k$ 都有其独立的协方差矩阵 ($\Sigma_k$)。这使得每个类别的分布具有独特的形状、大小和方向。这种灵活性导致了二次决策边界（类似于抛物线、双曲线或椭圆）。 Analogy 💡: Imagine you’re drawing boundaries around different clusters of stars. LDA gives you only straight lines to separate the clusters. QDA gives you curved lines (circles, ellipses), which can create a much better fit if the clusters themselves are elliptical and point in different directions. 想象一下，你正在围绕不同的星团绘制边界。LDA 只提供直线来分隔星团。QDA 提供曲线（圆形、椭圆形），如果星团本身是椭圆形且指向不同的方向，则可以产生更好的拟合效果。

## The Math Behind QDA

QDA classifies a new observation $x$ to the class $k$ that has the highest discriminant score, $\delta_k(x)$. The formula for this score is what makes the boundary quadratic. QDA 将新的观测值 $x$ 归类到具有最高判别分数 $\delta_k(x)$ 的类 $k$ 中。该分数的公式使得边界具有二次项。

The discriminant function for class $k$ is: \[\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}\log(|\Sigma_k|) + \log(\pi_k)\]

Let’s break it down:

$(x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k)$: This is a quadratic term (since it involves $x^T \Sigma_k^{-1} x$). It measures the squared Mahalanobis distance from $x$ to the class mean $\mu_k$, scaled by that class’s specific covariance $\Sigma_k$.
$\log(|\Sigma_k|)$: A term that penalizes classes with larger variance.
$\log(\pi_k)$: The prior probability of class $k$. This is our initial belief about how likely class $k$ is, before seeing the data.
- $(x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k)$：这是一个二次项（因为它涉及 $x^T \Sigma_k^{-1} x$）。它测量从 $x$ 到类均值 $\mu_k$ 的平方马氏距离，并根据该类的特定协方差 $\Sigma_k$ 进行缩放。
- $\log(|\Sigma_k|)$：用于惩罚方差较大的类的项。
- $\log(\pi_k)$：类 $k$ 的先验概率。这是我们在看到数据之前对类 $k$ 可能性的初始信念。 Because each class $k$ has its own $\Sigma_k$, the quadratic term doesn’t cancel out when comparing scores between classes, leading to a quadratic boundary. 由于每个类 $k$ 都有其自己的 $\Sigma_k$，因此在比较类之间的分数时，二次项不会抵消，从而导致二次边界。 Key Trade-off:
If the class variances ($\Sigma_k$) are truly different, QDA is better.
If the class variances are similar, LDA is often better because it’s less flexible and less likely to overfit, especially with a small number of training samples.
如果类方差 ($\Sigma_k$) 确实不同，QDA 更好。
如果类方差相似，LDA 通常更好，因为它的灵活性较差，并且不太可能过拟合，尤其是在训练样本数量较少的情况下。

## Code Implementation: R and Python

The slides provide R code for fitting a QDA model and evaluating it. Below is an explanation of the R code and its equivalent in Python using the popular scikit-learn library.

R Code (from the slides)

The code uses the MASS library for QDA and the ROCR library for evaluation.

# ######## QDA ##########

# 1. Fit the model on the training data
# This formula `Default~.` means "predict 'Default' using all other variables".
qda.fit.mod2 <- qda(Default~., data=Default, subset=train.ids)

# 2. Make predictions on the test data
# We are interested in the posterior probabilities for the ROC curve
qda.fit.pred3 <- predict(qda.fit.mod2, Default_test)$posterior[,2]

# 3. Evaluate using ROC and AUC
# 'prediction' and 'performance' are functions from the ROCR library
perf <- performance(prediction(qda.fit.pred3, Default_test$Default), "auc")

# 4. Get the AUC value
auc_value <- perf@y.values[[1]]
# Result from slide: 0.9638683

Python Equivalent (`scikit-learn`)

Here’s how you would perform the same steps in Python.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Assume 'Default' is your DataFrame and 'default' is the target column
# (preprocessing 'student' and 'default' columns to numbers)
# Default['default_num'] = Default['default'].apply(lambda x: 1 if x == 'Yes' else 0)
# X = Default[['balance', 'income', ...]]
# y = Default['default_num']

# 1. Split data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# 2. Initialize and fit the QDA model
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

# 3. Predict probabilities on the test set
# We need the probability of the positive class ('Yes') for the AUC calculation
y_pred_proba = qda.predict_proba(X_test)[:, 1]

# 4. Calculate the AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"AUC Score for QDA: {auc_score:.7f}")

# You can also plot the ROC curve
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# plt.plot(fpr, tpr)
# plt.show()

## Model Evaluation: ROC and AUC

The slides correctly emphasize using the ROC curve and the Area Under the Curve (AUC) to compare model performance.

ROC Curve (Receiver Operating Characteristic): This plot shows how well a model can distinguish between two classes. It plots the True Positive Rate (y-axis) against the False Positive Rate (x-axis) at all possible classification thresholds. A better model has a curve that is closer to the top-left corner.
AUC (Area Under the Curve): This is a single number that summarizes the entire ROC curve.
- AUC = 1: Perfect classifier.
- AUC = 0.5: A useless classifier (equivalent to random guessing).
- AUC > 0.7: Generally considered an acceptable model.
ROC 曲线（接收者操作特征）：此图显示了模型区分两个类别的能力。它绘制了所有可能的分类阈值下的 真阳性率（y 轴）与 假阳性率（x 轴）的对比图。更好的模型的曲线越靠近左上角，效果就越好。
- AUC（曲线下面积）：这是一个概括整个 ROC 曲线的数值。
- AUC = 1：完美的分类器。
- AUC = 0.5：无用的分类器（相当于随机猜测）。
- AUC > 0.7：通常被认为是可接受的模型。

The slides show that for the Default dataset, LDA’s AUC (0.9647) was slightly higher than QDA’s (0.9639). This suggests that the assumption of a common covariance matrix (LDA) was a slightly better fit for this particular test set, possibly because QDA’s extra flexibility wasn’t needed and it may have slightly overfit the training data. 这表明，对于这个特定的测试集，公共协方差矩阵 (LDA) 的假设拟合度略高，可能是因为 QDA 的额外灵活性并非必需，并且可能对训练数据略微过拟合。

## Key Takeaways and Important Images

Here’s a ranking of the most important visual aids in your slides:

Slide 68/69 (Model Assumption & Formula): These are the most critical slides. They present the core theoretical difference between LDA and QDA and provide the mathematical foundation (the discriminant function formula). Understanding these is key to understanding QDA.
Slide 73 (ROC Comparison): This is the most important image for practical evaluation. It visually compares the performance of LDA and QDA side-by-side, making it easy to see which one performs better on this specific dataset. The concept of AUC is introduced here as the method for comparison.
Slide 71 (Decision Boundaries with Different Thresholds): This is an excellent conceptual image. It shows how the quadratic decision boundary (the curved lines) separates the data points. It also illustrates how changing the probability threshold (from 0.1 to 0.5 to 0.9) shifts the boundary, trading off between precision and recall.

Of course. Here is a summary of the remaining slides, which compare QDA to other popular classification models like Logistic Regression and K-Nearest Neighbors (KNN).

Visualizing the Core Trade-off: LDA vs. QDA

This is the most important concept in these slides. The choice between LDA and QDA depends entirely on the underlying structure of your data.

The slide shows two scenarios: 1. Left Plot ($\Sigma_1 = \Sigma_2$): When the true covariance matrices of the classes are the same, the optimal decision boundary (the Bayes classifier) is a straight line. LDA, which assumes equal covariances, creates a linear boundary that approximates this optimal boundary very well. QDA’s flexible, curved boundary is unnecessarily complex and might overfit the training data. In this case, LDA is better. 2. Right Plot ($\Sigma_1 \neq \Sigma_2$): When the true covariance matrices are different, the optimal decision boundary is a curve. QDA’s quadratic model can capture this non-linearity much better than LDA’s rigid linear model. In this case, QDA is better.

This perfectly illustrates the bias-variance tradeoff. LDA has higher bias (it’s less flexible) but lower variance. QDA has lower bias (it’s more flexible) but higher variance.

Comparing Performance on the “Default” Dataset

The slides compare four different models on the same classification task. Let’s look at their performance using the Area Under the Curve (AUC), where a higher score is better.

LDA AUC: 0.9647
QDA AUC: 0.9639
Logistic Regression AUC: 0.9645
K-Nearest Neighbors (KNN): The plot shows test error vs. K. The error is lowest around K=4, but it’s not directly converted to an AUC score in the slides.

Interestingly, for this particular dataset, LDA, QDA, and Logistic Regression perform almost identically. This suggests that the decision boundary for this problem is likely very close to linear, meaning the extra flexibility of QDA isn’t providing much benefit.

Pros and Cons: Which Model to Choose?

The final slide asks for a comparison of the models. Here’s a summary of their key characteristics:

Model	Type	Decision Boundary	Key Pro	Key Con
Logistic Regression	Parametric	Linear	Highly interpretable, no strong assumptions about data distribution.	Inflexible; cannot capture non-linear relationships.
Linear Discriminant Analysis (LDA)	Parametric	Linear	More stable than Logistic Regression when classes are well-separated.	Assumes data is normally distributed with equal covariance matrices for all classes.
Quadratic Discriminant Analysis (QDA)	Parametric	Quadratic (Curved)	More flexible than LDA; can model non-linear boundaries.	Requires more data to estimate parameters and is more prone to overfitting. Assumes normality.
K-Nearest Neighbors (KNN)	Non-Parametric	Highly Non-linear	Extremely flexible; makes no assumptions about the data’s distribution.	Can be slow on large datasets and suffers from the “curse of dimensionality.” Less interpretable.

Summary of the Comparison:

Linear Models (Logistic Regression & LDA): Choose these for simplicity, interpretability, and when you believe the relationship between predictors and the class is linear. LDA often outperforms Logistic Regression if its normality assumptions are met.
Non-Linear Models (QDA & KNN): Choose these when the decision boundary is likely more complex. QDA is a good middle ground, offering more flexibility than LDA without being as completely data-driven as KNN. KNN is the most flexible but requires careful tuning of the parameter K to avoid overfitting or underfitting.

9. Here is a more detailed, slide-by-slide analysis of the presentation.

4.6 Four Classification Methods: Comparison by Simulation

This section (slides 81-87) introduces four classification methods and systematically compares their performance on six different simulated datasets. The goal is to see which method works best under different conditions (e.g., linear vs. non-linear boundaries, normal vs. non-normal data).

The four methods being compared are: * Logistic Regression: A linear method that models the log-odds as a linear function of the predictors. * Linear Discriminant Analysis (LDA): Another linear method. It also assumes a linear decision boundary but makes stronger assumptions than logistic regression (e.g., that data within each class is normally distributed with a common covariance matrix). * Quadratic Discriminant Analysis (QDA): A non-linear method. It assumes the log-odds are a quadratic function, which creates a more flexible, curved decision boundary. It assumes data within each class is normally distributed, but without a common covariance matrix. * K-Nearest Neighbors (KNN): A non-parametric, highly flexible method. Two versions are tested: * KNN-1 ($K=1$): A very flexible (high variance) model. * KNN-CV: A tuned model where the best $K$ is chosen via cross-validation.

比较的四种方法是： * 逻辑回归：一种将对数概率建模为预测变量线性函数的线性方法。 * 线性判别分析 (LDA)：另一种线性方法。它也假设线性决策边界，但比逻辑回归做出更强的假设（例如，每个类中的数据呈正态分布，且具有共同的协方差矩阵）。 * 二次判别分析 (QDA)：一种非线性方法。它假设对数概率为二次函数，从而创建一个更灵活、更弯曲的决策边界。它假设每个类中的数据服从正态分布，但没有共同的协方差矩阵。 * K最近邻 (KNN)：一种非参数化、高度灵活的方法。测试了两个版本： * KNN-1 ($K=1$)：一个非常灵活（高方差）的模型。 * KNN-CV：一个经过调整的模型，通过交叉验证选择最佳的$K$。

Analysis of Simulation Scenarios

The performance is measured by the test error rate (lower is better), shown in the boxplots for each scenario. 性能通过测试错误率（越低越好）来衡量，每个场景的箱线图都显示了该错误率。

Scenario 1 (Slide 82):
- Setup: A linear decision boundary. Data is normally distributed with uncorrelated predictors.
- Result: LDA and Logistic Regression perform best. Their test error rates are low and similar. This is expected, as the setup perfectly matches their core assumption (linear boundary). QDA is slightly worse because its extra flexibility (being quadratic) is unnecessary. KNN-1 is the worst, as its high flexibility leads to high variance (overfitting).
- 结果： LDA 和逻辑回归表现最佳。它们的测试错误率较低且相似。这是意料之中的，因为设置完全符合它们的核心假设（线性边界）。QDA 略差，因为其额外的灵活性（二次方）是不必要的。KNN-1 最差，因为其高灵活性导致方差较大（过拟合）。
Scenario 2 (Slide 83):
- Setup: Same as Scenario 1 (linear boundary, normal data), but now the two predictors have a correlation of 0.5.
- Result: Almost no change from Scenario 1. LDA and Logistic Regression are still the best. This shows that these linear methods are robust to correlation between predictors.
- 结果：与场景 1 相比几乎没有变化。LDA 和逻辑回归仍然是最佳。这表明这些线性方法对预测因子之间的相关性具有鲁棒性。
Scenario 3 (Slide 84):
- Setup: A linear decision boundary, but the data is drawn from a t-distribution (which is non-normal and has “heavy tails,” or more extreme outliers).
- Result: Logistic Regression is the clear winner. LDA’s performance gets worse because its assumption of normality is violated by the t-distribution. QDA’s performance deteriorates significantly due to the non-normality. This highlights a key difference: logistic regression is more robust to violations of the normality assumption.
- 结果：逻辑回归明显胜出**。LDA 的性能会变差，因为 t 分布违反了其正态性假设。QDA 的性能由于非正态性而显著下降。这凸显了一个关键区别：逻辑回归对违反正态性假设的情况更稳健。
Scenario 4 (Slide 85):
- Setup: A quadratic decision boundary. Data is normally distributed with different correlations in each class.
- Result: QDA is the clear winner by a large margin. This setup perfectly matches QDA’s assumption (quadratic boundary from normal data with different covariance structures). All other methods (LDA, Logistic, KNN) are linear or not flexible enough, so they perform poorly.
- 结果：QDA 明显胜出**，且遥遥领先。此设置完全符合 QDA 的假设（来自具有不同协方差结构的正态数据的二次边界）。所有其他方法（LDA、Logistic、KNN）都是线性的或不够灵活，因此性能不佳。
Scenario 5 (Slide 86):
- Setup: Another quadratic boundary, but generated in a different way (using a logistic function of quadratic terms).
- Result: QDA performs best again, closely followed by the flexible KNN-CV. The linear methods (LDA, Logistic) have poor performance because they cannot capture the curve.
- 结果：QDA 再次表现最佳，紧随其后的是灵活的KNN-CV。线性方法（LDA、Logistic）性能较差，因为它们无法捕捉曲线。
Scenario 6 (Slide 87):
- Setup: A complex, non-linear decision boundary (more complex than a simple quadratic curve).
- Result: The flexible KNN-CV method is the winner. Its non-parametric nature allows it to approximate the complex shape. QDA is not flexible enough and performs worse. This slide highlights the bias-variance trade-off: the overly simple KNN-1 is the worst, but the tuned KNN-CV is the best.
- 结果：灵活的 KNN-CV 方法胜出**。其非参数特性使其能够近似复杂的形状。 QDA 不够灵活，性能较差。这张幻灯片重点介绍了偏差-方差权衡：过于简单的 KNN-1 最差，而 调整后的 KNN-CV 最好。

4.7 R Example on Smarket Data

This section (slides 88-93) applies Logistic Regression and LDA to the Smarket dataset from the ISLR package to predict the stock market’s Direction (Up or Down). 本节（幻灯片 88-93）将逻辑回归和 LDA 应用于“ISLR”包中的“Smarket”数据集，以预测股市的“方向”（上涨或下跌）。 ### Data Preparation (Slides 88, 89, 90)

Load Data: The ISLR library is loaded, and the Smarket dataset is explored. It contains daily percentage returns (Lag1…Lag5 for the previous 5 days, Today), Volume, and the Year.
Explore Data: A correlation matrix (cor(Smarket[,-9])) is computed, and a plot of Volume over time is generated.
Split Data: The data is split into a training set (Years 2001-2004) and a test set (Year 2005).
- train <- (Year<2005)
- Smarket.2005 <- Smarket[!train,]
- Direction.2005 <- Direction[!train]
- The test set has 252 observations.
加载数据：加载“ISLR”库，并探索“Smarket”数据集。该数据集包含每日百分比收益率（前 5 天的“Lag1”…“Lag5”，“今日”）、“成交量”和“年份”。
探索数据：计算相关矩阵 (cor(Smarket[,-9]))，并生成“成交量”随时间变化的图表。
拆分数据：将数据拆分为训练集（年份 2001-2004）和测试集（年份 2005）。
- train <- (Year<2005)
- Smarket.2005 <- Smarket[!train,]
- Direction.2005 <- Direction[!train]
- 测试集包含 252 个观测值。

Model 1: Logistic Regression (All Predictors) (Slide 90)

Model: A logistic regression model is fit on the training data using all predictors.
- glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family=binomial, subset=train)
Prediction: The model is used to predict the direction for the 2005 test data.
- glm.probs <- predict(glm.fit, Smarket.2005, type="response")
- A threshold of 0.5 is used to classify: if $P(\text{Up}) > 0.5$, predict “Up”.
Results:
- Test Error Rate: 0.5198 (or 48.0% accuracy).
- Conclusion: This is “not good!”—it’s worse than flipping a coin. This suggests the model is either too complex or the predictors are not useful.

Model 2: Logistic Regression (Lag1 & Lag2) (Slide 91)

Model: Based on the poor results, a simpler model is tried, using only Lag1 and Lag2.
- glm.fit <- glm(Direction ~ Lag1 + Lag2, data=Smarket, family=binomial, subset=train)
Prediction: Predictions are made on the 2005 test set.
Results:
- Test Error Rate: 0.4404 (or 55.95% accuracy). This is an improvement.
- Confusion Matrix: | | True Down | True Up | | :— | :— | :— | | Pred Down | 77 | 69 | | Pred Up | 35 | 71 |
- ROC and AUC: The ROC (Receiver Operating Characteristic) curve is plotted, and the AUC (Area Under the Curve) is calculated.
- AUC Value: 0.5584. This is very close to 0.5 (which represents a random-chance model), indicating that the model has very weak predictive power, even though its accuracy is above 50%.

Model 3: LDA (Lag1 & Lag2) (Slide 92)

Model: LDA is now performed using the same setup: Lag1 and Lag2 as predictors, trained on the 2001-2004 data.
- library(MASS)
- lda.fit <- lda(Direction ~ Lag1 + Lag2, data=Smarket, subset=train)
Prediction: Predictions are made on the 2005 test set.
- lda.pred <- predict(lda.fit, Smarket.2005)
Results:
- Test Error Rate: 0.4404 (or 55.95% accuracy).
- Confusion Matrix: | | True Down | True Up | | :— | :— | :— | | Pred Down | 77 | 69 | | Pred Up | 35 | 71 |
- Observation: The confusion matrix and accuracy are identical to the logistic regression model.

Final Comparison (Slide 93)

ROC and AUC for LDA: The ROC curve for the LDA model is plotted.
AUC Value: 0.5584.
Main Conclusion: As highlighted in the green box, “LDA has identical performance as Logistic regression!” In this specific practical example, using these two predictors, both linear methods produce the exact same confusion matrix, the same accuracy (56%), and the same AUC (0.558). This reinforces the theoretical idea that both are fitting a linear boundary.

最终比较（幻灯片 93）

LDA 的 ROC 和 AUC：绘制了 LDA 模型的 ROC 曲线。
AUC 值：0.5584**。
主要结论：如绿色方框所示，“LDA 的性能与 Logistic 回归相同！”** 在这个具体的实际示例中，使用这两个预测变量，两种线性方法都产生了完全相同的混淆矩阵、相同的准确率（56%）和相同的 AUC（0.558）。这强化了两者均拟合线性边界的理论观点。

4.7 R Example on Smarket Data (Continued)

The previous slides showed that Logistic Regression and Linear Discriminant Analysis (LDA) had identical performance on the Smarket dataset (using Lag1 and Lag2), both achieving 56% test accuracy and an AUC of 0.558. The analysis now tests a more flexible method, QDA.

Model 3: QDA (Lag1 & Lag2) (Slides 94-95)

Model: A Quadratic Discriminant Analysis (QDA) model is fit on the same training data (2001-2004) using only the Lag1 and Lag2 predictors.
- qda.fit <- qda(Direction ~ Lag1 + Lag2, data=Smarket, subset=train)
Prediction: The model is used to predict the market direction for the 2005 test set.
Results:
- Test Accuracy: The model achieves a test accuracy of 0.5992 (or 60%).
- AUC: The Area Under the Curve (AUC) for the QDA model is 0.562.
Conclusion: As the slide highlights, “QDA has better test performance than LDA and Logistic regression!”

Smarket Example Summary

Method	Model Type	Test Accuracy	AUC
Logistic Regression	Linear	~56%	0.558
LDA	Linear	~56%	0.558
QDA	Quadratic	~60%	0.562

This practical example reinforces the lessons from the simulations (Section 4.6). The two linear methods (LDA, Logistic) had identical performance. The more flexible, non-linear QDA model performed better, suggesting that the true decision boundary between “Up” and “Down” (based on Lag1 and Lag2) is not perfectly linear.

4.8 Kernel LDA

This new section introduces an even more advanced non-linear method, Kernel LDA.

The Problem: Linear Inseparability (Slide 97)

The section starts with a clear visual example. A dataset of two concentric circles (a “donut” shape) is linearly inseparable. It is impossible to draw a single straight line to separate the inner (purple) class from the outer (yellow) class.

The Solution: The Kernel Trick (Slides 97, 99)

Nonlinear Transformation: The data is “lifted” into a higher-dimensional feature space using a nonlinear transformation, $x \mapsto \phi(x)$. In the example on the slide, the 2D data is transformed, and in this new space, the two classes become linearly separable.
The “Kernel Trick”: The main idea (from slide 99) is that we don’t need to explicitly compute this complex transformation $\phi(x)$. LDA (based on Fisher’s approach) only requires inner products of the data points. The “kernel trick” allows us to replace the inner product in the high-dimensional feature space ($x_i^T x_j$) with a simple kernel function, $k(x_i, x_j)$, computed in the original, low-dimensional space.
- An example of such a kernel is the Gaussian (RBF) kernel: $k(x_i, x_j) \propto e^{-\|x_i - x_j\|^2 / \sigma^2}$.

Academic Foundations (Slide 98)

This method is based on foundational academic papers that generalized linear methods using kernels: * Fisher discriminant analysis with kernels (Mika, 1999) * Generalized Discriminant Analysis Using a Kernel Approach (Baudat, 2000) * Kernel principal component analysis (Schölkopf, 1997)

In short, Kernel LDA is an extension of LDA that uses the kernel trick to find a linear boundary in a high-dimensional feature space, which corresponds to a highly non-linear boundary in the original space.

QM9 Dataset

发表于 2025-09-27 更新于 2025-09-29 分类于 dataset

1. QM9 数据集的XYZ格式详解

这个数据集使用的 “XYZ-like” 格式是一种扩展的、非标准的XYZ格式。

行号	内容	解释
第 1 行	`na`	一个整数，代表分子中的原子总数。
第 2 行	`Properties 1-17`	包含17个理化性质的数值，用制表符或空格分隔。
第 3 到 na+2 行	`Element x y z charge`	每行代表一个原子。依次是：元素符号、x/y/z坐标（单位：埃）、Mulliken部分电荷（单位：e）。
第 na+3 行	`Frequencies`	分子的振动频率（3na-5或3na-6个）。
第 na+4 行	`SMILES_GDB9 SMILES_relaxed`	来自GDB9的SMILES字符串和弛豫后的几何构型的SMILES字符串。
第 na+5 行	`InChI_GDB9 InChI_relaxed`	对应的InChI字符串。

与标准XYZ格式对比： * 标准格式只有第1行（原子数）、第2行（注释）和后续的原子坐标行（仅含元素和xyz坐标）。 * QM9格式在第2行插入了大量属性数据，在原子坐标行增加了电荷列，并在文件末尾附加了频率、SMILES和InChI信息。

2. readme

数据集核心内容:
- 它包含了133,885个小型有机分子（由H, C, N, O, F元素组成）的量子化学计算数据。
- 所有分子的几何构型都经过了DFT/B3LYP/6-31G(2df,p)水平的优化。
- dsC7O2H10nsd.xyz.tar.bz2是该数据集的一个子集，专门包含6,095个C₇H₁₀O₂的同分异构体，其能量学性质在更高精度的G4MP2理论水平下计算。
文件结构与格式:
- 明确指出每个分子存储在单独的.xyz文件中，并详细描述了上述的非标准XYZ扩展格式。
- 详细列出了记录在文件第2行的17种理化性质，包括转动常数(A, B, C)、偶极矩(mu)、HOMO/LUMO能级、零点振动能(zpve)、内能(U)、焓(H)和吉布斯自由能(G)等。
数据来源与计算方法:
- 数据源于GDB-9化学数据库。
- 主要使用了两种量子化学理论水平：B3LYP用于大部分属性计算，G4MP2用于C₇H₁₀O₂子集的能量计算。
引用要求:
- 文件明确要求，如果使用该数据集，需要引用Raghunathan Ramakrishnan等人在2014年发表于《Scientific Data》的论文。
其他信息:
- 提供了一些额外文件（如validation.txt, uncharacterized.txt）的说明。
- 提到了数据集中有少数几个分子在几何优化时难以收敛。

3. 可视化

import ase.io
import nglview as nv
import io

def parse_qm9_xyz(file_path):
    """
    Parses a QM9 extended XYZ file and returns a standard XYZ string.
    """
    with open(file_path, 'r') as f:
        lines = f.readlines()
    
    # First line is the number of atoms
    num_atoms = int(lines[0].strip())
    
    # The next line is properties (skip it)
    # The next num_atoms lines are the coordinates
    coord_lines = lines[2:2+num_atoms]
    
    # Rebuild a standard XYZ format string in memory
    standard_xyz = f"{num_atoms}\n"
    standard_xyz += "Comment line\n" # Add a standard comment line
    for line in coord_lines:
        parts = line.split()
        # Keep only the element and the x, y, z coordinates
        standard_xyz += f"{parts[0]} {parts[1]} {parts[2]} {parts[3]}\n"
        
    return standard_xyz

# Path to your data file
file_path = "/root/QM9/QM9/Data_for_6095_constitutional_isomers_of_C7H10O2.xyz/dsC7O2H10nsd_0001.xyz"

# 1. Parse the special file format into a standard XYZ string
standard_xyz_data = parse_qm9_xyz(file_path)

# 2. ASE reads the standard XYZ data from the string variable
#    We use io.StringIO to make the string behave like a file
atoms = ase.io.read(io.StringIO(standard_xyz_data), format="xyz")

# 3. Create the nglview visualization widget
view = nv.show_ase(atoms)
view.add_ball_and_stick()

# Display the widget in the notebook output
view

定义解析函数 parse_qm9_xyz:
- 目的: 将这个函数作为专门处理QM9特殊格式的工具。代码主体清晰，易于复用。
- 读取文件: with open(...) 安全地打开文件，并用 f.readlines() 将文件所有行一次性读入一个列表 lines 中。
- 提取原子数量: num_atoms = int(lines[0].strip()) 读取第一行（lines[0]），去除可能存在的空格（.strip()），并将其转换为整数。这是构建标准XYZ格式的必要信息。
- 提取坐标信息: coord_lines = lines[2:2+num_atoms] 标信息从第3行开始（索引为2），持续num_atoms行。通过列表切片，精确地提取出所有包含原子坐标的行，跳过了第2行的属性信息。
- 构建标准XYZ格式字符串:
  - 创建一个名为 standard_xyz 的新字符串。
  - 首先，将原子数量和换行符写入。
  - 然后，添加一行标准的注释（“Comment line”），这是标准XYZ格式所要求的。
  - 最后，遍历刚刚提取的 coord_lines 列表。对于每一行，使用 .split() 将其拆分成多个部分（例如：['C', 'x', 'y', 'z', 'charge']）。只取前四部分（元素符号和xyz坐标），并重新组合成新的一行，从而丢弃了末尾的Mulliken电荷数据。
- 返回结果: 函数返回一个包含了标准XYZ格式数据的、干净的字符串。
主程序执行流程:
- 调用函数: standard_xyz_data = parse_qm9_xyz(file_path) 调用上面的函数，完成从文件到标准格式字符串的转换。
- 在内存中读取: ase.io.read(io.StringIO(standard_xyz_data), format="xyz") 这一步非常高效。io.StringIO 将我们的字符串变量 standard_xyz_data 模拟成一个内存中的文本文件。这样，ase.io.read 就可以直接读取它，而无需先将清洗后的数据写入一个临时文件再读取，节省了磁盘I/O操作。
- 可视化: 接下来的代码 (nv.show_ase等) 就和最初的设想一样了，因为此时 atoms 对象已经是通过标准、干净的数据成功创建的了。

FusionProt - 论文阅读

发表于 2025-09-27 更新于 2025-09-29 分类于 Paper Reading

Fusing Sequence and Structural Information for Unified Protein Representation Learning

FusionProt

1 蛋白质表示学习：

内容:

FusionProt :可学习融合 token和迭代双向信息交换，实现序列与结构的动态协同学习，而非静态拼接。

2. 一维（1D）氨基酸序列和三维（3D）空间结构：

单模态依赖: ProteinBERT、ESM-2仅基于序列
静态融合缺陷 :ESM-GearNet、SaProt 结合序列与结构，但采用 “单向 / 一次性融合”

好的，完全没有问题。这是对 FusionNetwork 模型架构代码的中文复述分析。

3. 模型总体

@R.register("models.FusionNetwork")
class FusionNetwork(nn.Module, core.Configurable):
    def __init__(self, sequence_model, structure_model, fusion="series", cross_dim=None):
        super(FusionNetwork, self).__init__()
        self.sequence_model = sequence_model
        self.structure_model = structure_model
        self.output_dim = sequence_model.output_dim + structure_model.output_dim
        self.inject_step = 5   # (sequence_layers / structure_layers) layers

class FusionNetwork(...): 定义了模型类，它继承自 PyTorch 的基础模块 nn.Module。
__init__(...): 构造函数，接收已经初始化好的 sequence_model 和 structure_model 作为输入。
self.output_dim: 定义了模型最终输出特征的维度。因为最后会将两个模型的特征拼接起来，所以是两者输出维度之和。
self.inject_step = 5:定义了信息“注入”或“交流”的频率。这里设置为 5，意味着每经过序列模型的 5 层，就会进行一次信息交换。

# Structure embeddings layer
raw_input_dim = 21  # amino acid tokens
self.structure_embed_linear = nn.Linear(raw_input_dim, structure_model.input_dim)
self.embedding_batch_norm = nn.BatchNorm1d(structure_model.input_dim)

self.structure_embed_linear: 一个线性层，用于将原始的结构输入（比如 21 种氨基酸的独热编码）转换为结构模型（GNN）所期望的输入维度。
self.embedding_batch_norm: 批归一化层，用于稳定结构嵌入层的训练过程。

# Normal Initialization of the 3D structure token
structure_token = nn.Parameter(torch.Tensor(structure_model.input_dim).unsqueeze(0))
nn.init.normal_(structure_token, mean=0.0, std=0.01)
self.structure_token = nn.Parameter(structure_token.squeeze(0))

self.structure_token: 一个可学习的向量 (nn.Parameter)。这个“令牌”不代表任何真实的原子或氨基酸，而是一个抽象的载体。在训练过程中，它将学习如何编码和表示整个蛋白质的全局 3D 结构信息。它就像一个信息信使。

1
2
3

# Linear Transformation between structure to sequential spaces
self.structure_linears = nn.ModuleList([...])
self.seq_linears = nn.ModuleList([...])

self.structure_linears / self.seq_linears: 序列模型和结构模型内部处理的特征向量维度可能不同。当“3D 令牌”需要在两个模型之间传递时，这些线性层负责将它的表示从一个模型的特征空间转换到另一个模型的特征空间。

4. 前向

1
2
3

def forward(self, graph, input, all_loss=None, metric=None):
    # Build a new protein graph with the 3D token (the lase node)
    new_graph = self.build_protein_graph_with_3d_token(graph)

首先调用辅助函数，将输入的蛋白质图谱进行改造：为图谱增加一个代表“3D 令牌”的新节点，并将这个新节点与图中所有其他节点连接起来。

序列模型的初始化

# Sequence (ESM) model initialization
sequence_input = self.sequence_model.mapping[graph.residue_type]
sequence_input[sequence_input == -1] = graph.residue_type[sequence_input == -1]
size = graph.num_residues

# Check if sequence size is not bigger than max seq length
if (size > self.sequence_model.max_input_length).any():
    starts = size.cumsum(0) - size
    size = size.clamp(max=self.sequence_model.max_input_length)
    ends = starts + size
    mask = functional.multi_slice_mask(starts, ends, graph.num_residues)
    sequence_input = sequence_input[mask]
    graph = graph.subresidue(mask)
size_ext = size

# BOS == CLS
if self.sequence_model.alphabet.prepend_bos:
    bos = torch.ones(graph.batch_size, dtype=torch.long, device=self.sequence_model.device) * self.sequence_model.alphabet.cls_idx
    sequence_input, size_ext = functional._extend(bos, torch.ones_like(size_ext), sequence_input, size_ext)

if self.sequence_model.alphabet.append_eos:
    eos = torch.ones(graph.batch_size, dtype=torch.long, device=self.sequence_model.device) * self.sequence_model.alphabet.eos_idx
    sequence_input, size_ext = functional._extend(sequence_input, size_ext, eos, torch.ones_like(size_ext))

# Padding
tokens = functional.variadic_to_padded(sequence_input, size_ext, value=self.sequence_model.alphabet.padding_idx)[0]
repr_layers = [self.sequence_model.repr_layer]
assert tokens.ndim == 2
padding_mask = tokens.eq(self.sequence_model.model.padding_idx)  # B, T

序列数据进行 Transformer 模型（如 ESM）所需的标准预处理。
包括添加序列开始（BOS）和结束（EOS）标记，以及将所有序列填充（Padding）到相同长度，以便进行批处理。

模型初始化与初次融合

# Sequence embedding layer
x = self.sequence_model.model.embed_scale * self.sequence_model.model.embed_tokens(tokens)

if self.sequence_model.model.token_dropout:
    x.masked_fill_((tokens == self.sequence_model.model.mask_idx).unsqueeze(-1), 0.0)
    # x: B x T x C
    mask_ratio_train = 0.15 * 0.8
    src_lengths = (~padding_mask).sum(-1)
    mask_ratio_observed = (tokens == self.sequence_model.model.mask_idx).sum(-1).to(x.dtype) / src_lengths
    x = x * (1 - mask_ratio_train) / (1 - mask_ratio_observed)[:, None, None]

# Structure model initialization
structure_hiddens = []
batch_size = graph.batch_size
structure_embedding = self.embedding_batch_norm(self.structure_embed_linear(input))
structure_token_batched = self.structure_token.unsqueeze(0).expand(batch_size, -1)
structure_input = torch.cat([structure_embedding.squeeze(1), structure_token_batched], dim=0)

# Add the 3D token representation
structure_token_expanded = self.structure_token.unsqueeze(0).expand(x.size(0), -1).unsqueeze(1)
x = torch.cat((x[:, :-1], structure_token_expanded, x[:, -1:]), dim=1)
padding_mask = torch.cat([padding_mask[:, :-1],
                          torch.zeros(padding_mask.size(0), 1).to(padding_mask), padding_mask[:, -1:]], dim=1)
size_ext += 1

if padding_mask is not None:
    x = x * (1 - padding_mask.unsqueeze(-1).type_as(x))

repr_layers = set(repr_layers)
hidden_representations = {}
if 0 in repr_layers:
    hidden_representations[0] = x

# (B, T, E) => (T, B, E)
x = x.transpose(0, 1)
if not padding_mask.any():
    padding_mask = None

将 3D 令牌插入序列。
1. 为序列数据生成初始的词嵌入表示 x。
2. 将 self.structure_token 的初始状态插入到序列嵌入 x 中，通常是放在序列结束标记（EOS）之前。
3. 序列模型看到的输入序列变成了 [BOS, 残基1, 残基2, ..., 残基N, **3D令牌**, EOS] 的形式。

融合循环

for seq_layer_idx, seq_layer in enumerate(self.sequence_model.model.layers):
    x, attn = seq_layer(
        x,
        self_attn_padding_mask=padding_mask,
        need_head_weights=False,
    )
    if (seq_layer_idx + 1) in repr_layers:
        hidden_representations[seq_layer_idx + 1] = x.transpose(0, 1)

模型开始逐层遍历序列模型的所有层（例如 Transformer 的编码器层）。x 在每一层都会被更新。

1	if seq_layer_idx > 0 and seq_layer_idx % self.inject_step == 0:

信息注入点：每当层数的索引能被 inject_step (即 5) 整除时，就触发一次信息交换。

# 1. 从序列中提取 3D 令牌的表示
if structure_layer_index == 0:
    structure_input = torch.cat((structure_input[:-1 * batch_size],  x[-2, :, :]), dim=0)
else:
    structure_input = torch.cat((structure_input[:-1 * batch_size],
                                 self.seq_linears[structure_layer_index](x[-2, :, :])), dim=0)

# 2. 用结构模型的一层来处理
hidden = self.structure_model.layers[structure_layer_index](new_graph, structure_input)
if self.structure_model.short_cut and hidden.shape == structure_input.shape:
    hidden = hidden + structure_input
if self.structure_model.batch_norm:
    hidden = self.structure_model.batch_norms[structure_layer_index](hidden)

structure_hiddens.append(hidden)
structure_input = hidden

# 3. 将更新后的 3D 令牌表示插回序列
updated_structure_token = self.structure_linears[...](structure_input[-1 * batch_size:])
x = torch.cat((x[:-2, :, :], updated_structure_token.unsqueeze(0), x[-1:, :, :]), dim=0)
structure_layer_index += 1

信息流程：
1. 从序列到结构：模型从序列表示 x 中提取出“3D 令牌”的最新向量。这个向量此时已经吸收了前面几层序列模型的上下文信息。然后，通过（seq_linears）将其转换后，更新到结构模型的输入中。
2. 结构信息处理：运行一层结构模型（GNN）。GNN 根据图的连接关系更新所有节点的表示，当然也包括“3D 令牌”这个特殊节点。
3. 从结构到序列：从 GNN 的输出中，再次提取出“3D 令牌”的向量。这个向量包含更新后的结构信息。再通过（structure_linears）转换后，把它插回到序列表示 x 中，替换掉旧的版本。

这个循环不断重复。

输出

# Structural Output
if self.structure_model.concat_hidden:
    structure_node_feature = torch.cat(structure_hiddens, dim=-1)[:-1 * batch_size]
else:
    structure_node_feature = structure_hiddens[-1][:-1 * batch_size]

structure_graph_feature = self.structure_model.readout(graph, structure_node_feature)

# Sequence Output
x = self.sequence_model.model.emb_layer_norm_after(x)
x = x.transpose(0, 1)  # (T, B, E) => (B, T, E)

# last hidden representation should have layer norm applied
if (seq_layer_idx + 1) in repr_layers:
    hidden_representations[seq_layer_idx + 1] = x
x = self.sequence_model.model.lm_head(x)

output = {"logits": x, "representations": hidden_representations}

# Sequence (ESM) model outputs
residue_feature = output["representations"][self.sequence_model.repr_layer]
residue_feature = functional.padded_to_variadic(residue_feature, size_ext)
starts = size_ext.cumsum(0) - size_ext
if self.sequence_model.alphabet.prepend_bos:
    starts = starts + 1
ends = starts + size
mask = functional.multi_slice_mask(starts, ends, len(residue_feature))
residue_feature = residue_feature[mask]
graph_feature = self.sequence_model.readout(graph, residue_feature)

# Combine both models outputs
node_feature = torch.cat(...)
graph_feature = torch.cat(...)

return {"graph_feature": graph_feature, "node_feature": node_feature}

提取输出：循环结束后，分别从两个模型中提取最终的特征表示。
读出（Readout）：使用一个“读出函数”（如求和或平均）将节点级别的特征聚合成一个代表整个蛋白质的图级别特征。
最终组合：将来自序列模型和结构模型的节点特征（node_feature）和图特征（graph_feature）分别拼接（concatenate）起来。
返回结果：返回一个包含组合后特征的字典，可用于下游任务（如功能预测、属性回归等）。

PHYS 5120 - Computational Energy Materials and Electronic Structure Simulations-W4

发表于 2025-09-26 分类于 PHYS-5120

PHYS 5120 - 计算能源材料和电子结构模拟 Lecture-4

Lecturer: Prof.PAN DING

1 Monte Carlo (MC) Method:

内容:

This whiteboard provides a concise but detailed overview of two important and related simulation techniques in computational physics and chemistry: the Metropolis Monte Carlo (MC) method and Hamiltonian (or Hybrid) Monte Carlo (HMC). Here is a detailed breakdown of the concepts presented.

1. Metropolis Monte Carlo (MC) Method

The heading “Metropolis MC method” introduces a foundational algorithm in statistical mechanics. Metropolis Monte Carlo is a method used to generate a sequence of states for a system, allowing for the calculation of average properties. 左上角的这一部分介绍了基础的 Metropolis Monte Carlo 算法。它是一种生成状态序列的方法，使得处于任何状态的概率都符合期望的概率分布（在物理学中通常是玻尔兹曼分布）。

Conceptual Diagram: The small box with numbered sites (0-5) and an arrow showing a move from state 0 to 2, and then to 3, illustrates a “random walk.” In Metropolis MC, the system transitions from one state to another by making small, random changes. 小方框中标有编号的位点（0-5），箭头表示从状态 0 到状态 2，再到状态 3 的移动，代表“随机游走”。在 Metropolis MC 中，系统通过进行微小的随机变化从一个状态过渡到另一个状态。
Random Number Generation: The notation rand t \in (0,1) indicates the use of a random number $t$ drawn from a uniform distribution between 0 and 1. This is a core component of the algorithm, used to decide whether to accept or reject a proposed new state. 符号 rand t \in (0,1) 表示使用从 0 到 1 之间的均匀分布中抽取的随机数 $t$。这是算法的核心部分，用于决定是否接受或拒绝提议的新状态。
Detailed Balance Condition: The equation $P_o T(o \to n) = P_n T(n \to o)$ is the principle of detailed balance. It states that in a system at equilibrium, the probability of being in an old state ($o$) and transitioning to a new state ($n$) is equal to the probability of being in the new state and transitioning back to the old one. This condition is crucial because it ensures that the simulation will eventually sample states according to their correct thermodynamic probabilities (the Boltzmann distribution). 方程 $P_o T(o \to n) = P_n T(n \to o)$ 是详细平衡的原理。它指出，在平衡系统中，处于旧状态 ($o$) 并转变为新状态 ($n$) 的概率等于处于新状态并转变回旧状态的概率。此条件至关重要，因为它确保模拟最终将根据正确的热力学概率（玻尔兹曼分布）对状态进行采样。
Acceptance Rate: The note \sim 30\%? likely refers to the target acceptance rate for an efficient Metropolis MC simulation. If new states are accepted too often or too rarely, the exploration of the system’s possible configurations is inefficient. While the famous optimal acceptance rate for certain high-dimensional problems is around 23.4%, a range of 20-50% is often considered effective. 注释“30%？”指的是高效 Metropolis 蒙特卡罗模拟的目标接受率。如果新状态接受过于频繁或过于稀少，系统对可能配置的探索就会变得低效。虽然某些高维问题的最佳接受率约为 23.4%，但通常认为 20-50% 的范围是有效的。

2. Hamiltonian / Hybrid Monte Carlo (HMC)

The second topic, “Hamiltonian/Hybrid MC (HMC),” is a more advanced Monte Carlo method that uses principles from classical mechanics to propose new states more intelligently than the simple random-walk approach of the standard Metropolis method. This often leads to a much higher acceptance rate and more efficient exploration of the state space. 第二个主题“哈密顿/混合蒙特卡罗 (HMC)”是一种更先进的蒙特卡罗方法，它利用经典力学原理，比标准 Metropolis 方法中简单的随机游走方法更智能地提出新状态。这通常会带来更高的接受率和更高效的状态空间探索。

The whiteboard outlines a four-step HMC algorithm:

Step 1: Randomize Velocities The first step is to randomize the velocities: $\vec{v}_i \sim \mathcal{N}(0, k_B T)$. 第一步是随机化速度：$\vec{v}_i \sim \mathcal{N}(0, k_B T)$。 * This step introduces momentum into the system. For each particle $i$, a velocity vector $\vec{v}_i$ is randomly drawn from a normal (Gaussian) distribution with a mean of 0 and a variance related to the temperature $T$ and the Boltzmann constant $k_B$. 此步骤将动量引入系统。对于每个粒子 $i$，速度矢量 $\vec{v}_i$ 会随机地从正态（高斯）分布中抽取，该分布的均值为 0，方差与温度 $T$ 和玻尔兹曼常数 $k_B$ 相关。 * The full formula for this probability distribution, $f(\vec{v})$, is the Maxwell-Boltzmann distribution, which is written out further down the board. 该概率分布的完整公式 $f(\vec{v})$ 是麦克斯韦-玻尔兹曼分布。

Step 2: Molecular Dynamics (MD) Integration The board notes this as t=0 \to h \text{ or } mh MD and mentions the Verlet algorithm.

This is the “Hamiltonian dynamics” part of the algorithm. Starting from the current positions and the newly randomized velocities, the system’s trajectory is calculated for a short period of time ($h$ or $mh$) using Molecular Dynamics (MD). 这是算法的“哈密顿动力学”部分。从当前位置和新随机化的速度开始，使用分子动力学 (MD) 计算系统在短时间内（$h$ 或 $mh$）的轨迹。
The name Verlet refers to the Verlet integration algorithm, a numerical method used to solve Newton’s equations of motion. It is popular in MD simulations because it is time-reversible and conserves energy well over long simulations. 指的是 Verlet 积分算法，这是一种用于求解牛顿运动方程的数值方法。它在 MD 模拟中很受欢迎，因为它具有时间可逆性，并且在长时间模拟中能量守恒效果良好。

Step 3: Calculate Total Energy The third step is to calculate total energy: $E_n = K_n + V_n$. 第三步是“计算总能量”：$E_n = K_n + V_n$。 * After the MD trajectory, the system is in a new state $n$. The total energy of this new state, $E_n$, is calculated as the sum of its kinetic energy ($K_n$, from the velocities) and its potential energy ($V_n$, from the positions). MD 轨迹之后，系统处于新状态 $n$。新状态的总能量 $E_n$ 等于其动能 ($K_n$，由速度计算得出）和势能 ($V_n$，由位置计算得出)之和。

Step 4: Acceptance Test The final step is the acceptance criterion: $\text{acc}(o \to n) = \min(1, e^{-\beta(E_n - E_o)})$. 最后一步是验收标准：$\text{acc}(o \to n) = \min(1, e^{-\beta(E_n - E_o)})$。 * This is the Metropolis acceptance criterion. The algorithm decides whether to accept the new state $n$ or reject it and stay in the old state $o$. 这是 Metropolis 验收标准。算法决定是接受新状态 $n$ 还是拒绝它并保持旧状态 $o$。 * The probability of acceptance depends on the change in total energy ($E_n - E_o$). If the new energy is lower, the move is always accepted. If the new energy is higher, it might still be accepted with a probability $e^{-\beta(E_n - E_o)}$, where $\beta = 1/(k_B T)$. This allows the system to escape from local energy minima. 验收概率取决于总能量的变化 ($E_n - E_o$)。如果新能量较低，则始终接受该移动。如果新的能量更高，它仍然可能以概率 $e^{-\beta(E_n - E_o)}$ 被接受，其中 $\beta = 1/(k_B T)$。这使得系统能够摆脱局部能量最小值。

Key Formulas and Notations

Maxwell-Boltzmann Distribution麦克斯韦-玻尔兹曼分布: The formula for the velocity distribution is given as: $f(\vec{v}) = \left(\frac{m}{2\pi k_B T}\right)^{3/2} \exp\left(-\frac{m v^2}{2 k_B T}\right)$ This gives the probability density for a particle of mass $m$ to have a velocity $\vec{v}$ at a given temperature $T$.质量为 $m$ 的粒子速度为的概率密度
Energy Conservation and Acceptance Rate: The notes $E_n \approx E_o$ and $75\%$ highlight a key advantage of HMC. Because the Verlet integrator approximately conserves energy, the final energy $E_n$ after the MD trajectory is usually very close to the initial energy $E_o$. This means the term $(E_n - E_o)$ is small, and the acceptance probability is high. The $75\%$ indicates a typical or target acceptance rate for HMC, which is significantly higher than for standard Metropolis MC. 注释 $E_n \approx E_o$ 和 $75\%$ 凸显了 HMC 的一个关键优势。由于 Verlet 积分器近似地守恒能量，MD 轨迹后的最终能量 $E_n$ 通常非常接近初始能量 $E_o$。这意味着 $(E_n - E_o)$ 项很小，接受概率很高。$75\%$ 表示 HMC 的典型或目标接受率，明显高于标准 Metropolis MC。
Hamiltonian Operator: The symbol $\hat{H}$ written on the adjacent board represents the Hamiltonian operator, which gives the total energy of the system. The note Δ Adiabatic suggests that the MD evolution is ideally an adiabatic process (no heat exchange), during which the total energy (the Hamiltonian) is conserved. 相邻板上的符号 $\hat{H}$ 代表哈密顿算符，它给出了系统的总能量。注释“Δ Adiabatic”表明 MD 演化在理想情况下是一个绝热过程（无热交换），在此过程中总能量（哈密顿量）守恒。

This whiteboard displays the fundamental equation of quantum chemistry: the time-dependent Schrödinger equation, along with the detailed breakdown of the molecular Hamiltonian operator. This equation is the starting point for almost all ab initio (first-principles) quantum mechanical calculations of molecular systems. 这块白板展示了量子化学的基本方程：含时薛定谔方程，以及分子哈密顿算符的详细分解。该方程是几乎所有分子系统从头算（第一性原理）量子力学计算的起点。

3. The Time-Dependent Schrödinger Equation

At the top of the board, the fundamental equation governing the evolution of a quantum mechanical system is presented: 白板顶部显示了控制量子力学系统演化的基本方程： $i\hbar \frac{\partial \Psi}{\partial t} = \hat{\mathcal{H}} \Psi$

$\Psi$ (Psi) is the wave function of the system. It contains all the information that can be known about the system (e.g., the positions and momenta of all particles). 是系统的波函数。它包含了关于系统的所有已知信息（例如，所有粒子的位置和动量）。
$\hat{\mathcal{H}}$ is the Hamiltonian operator, which represents the total energy of the system. 是哈密顿算符，表示系统的总能量。
$i$ 是虚数单位。
$i$ is the imaginary unit.
$\hbar$ is the reduced Planck constant.是约化普朗克常数。
$\frac{\partial \Psi}{\partial t}$ represents how the wave function changes over time.表示波函数随时间的变化。

This equation states that the time evolution of the quantum state is dictated by the system’s total energy operator, the Hamiltonian. The note “Δ Adiabatic process” likely connects to the context of the Born-Oppenheimer approximation, where the electronic Schrödinger equation is solved for fixed nuclear positions, assuming the electrons adjust adiabatically (instantaneously) to the motion of the nuclei. 该方程表明，量子态的时间演化由系统的总能量算符——哈密顿算符决定。注释“Δ绝热过程”与玻恩-奥本海默近似相关，在该近似中，电子薛定谔方程是针对固定原子核位置求解的，假设电子以绝热方式（瞬时）调整以适应原子核的运动。

4. The Full Molecular Hamiltonian ($\hat{\mathcal{H}}$)

The main part of the whiteboard is the detailed expression for the non-relativistic, time-independent molecular Hamiltonian. It is the sum of the kinetic and potential energies of all the nuclei and electrons in the system. The equation can be broken down into five distinct terms: 白板的主要部分是非相对论性、时间无关的分子哈密顿量的详细表达式。它是系统中所有原子核和电子的动能和势能之和。

该方程可以分解为五个不同的项：

$\hat{\mathcal{H}} = -\sum_{I=1}^{P} \frac{\hbar^2}{2M_I}\nabla_I^2 - \sum_{i=1}^{N} \frac{\hbar^2}{2m}\nabla_i^2 + \frac{e^2}{2}\sum_{I=1}^{P}\sum_{J \neq I}^{P} \frac{Z_I Z_J}{|\vec{R}_I - \vec{R}_J|} + \frac{e^2}{2}\sum_{i=1}^{N}\sum_{j \neq i}^{N} \frac{1}{|\vec{r}_i - \vec{r}_j|} - e^2\sum_{I=1}^{P}\sum_{i=1}^{N} \frac{Z_I}{|\vec{R}_I - \vec{r}_i|}$

Let’s analyze each component:

A. Kinetic Energy Terms 动能项

Kinetic Energy of the Nuclei 原子核的动能: $-\sum_{I=1}^{P} \frac{\hbar^2}{2M_I}\nabla_I^2$ This term is the sum of the kinetic energy operators for all the nuclei in the system.此项是系统中所有原子核的动能算符之和。
- The sum is over all nuclei, indexed by $I$ from 1 to $P$.该和涵盖所有原子核，索引为 $I$，从 1 到 $P$。
- $M_I$ is the mass of nucleus $I$.是原子核 $I$ 的质量。
- $\nabla_I^2$ is the Laplacian operator, which involves the second spatial derivatives with respect to the coordinates of nucleus $I$.是拉普拉斯算符，它涉及原子核 $I$ 坐标的二阶空间导数。
Kinetic Energy of the Electrons 电子的动能: $-\sum_{i=1}^{N} \frac{\hbar^2}{2m}\nabla_i^2$ This is the corresponding sum of the kinetic energy operators for all the electrons.这是所有电子的动能算符的对应和。
- The sum is over all electrons, indexed by $i$ from 1 to $N$.该和是针对所有电子的，索引为 $i$，从 1 到 $N$。
- $m$ is the mass of an electron.是电子的质量。
- $\nabla_i^2$ is the Laplacian operator with respect to the coordinates of electron $i$.是关于电子 $i$ 坐标的拉普拉斯算符。

B. Potential Energy Terms (Electrostatic Interactions) 势能项（静电相互作用）

Nuclear-Nuclear Repulsion 核间排斥力: $+\frac{e^2}{2}\sum_{I=1}^{P}\sum_{J \neq I}^{P} \frac{Z_I Z_J}{|\vec{R}_I - \vec{R}_J|}$ This term represents the potential energy from the electrostatic (Coulomb) repulsion between all pairs of positively charged nuclei.该项表示所有带正电原子核对之间静电（库仑）排斥力产生的势能。
- The double summation runs over all unique pairs of nuclei ($I, J$).对所有唯一的原子核对 ($I, J$) 进行双重求和。
- $Z_I$ is the atomic number (i.e., the charge) of nucleus $I$.是原子核 $I$ 的原子序数（即电荷）。
- $\vec{R}_I$ is the position vector of nucleus $I$.是原子核 $I$ 的位置矢量。
- $e$ is the elementary charge.是基本电荷。
Electron-Electron Repulsion 电子间排斥力: $+\frac{e^2}{2}\sum_{i=1}^{N}\sum_{j \neq i}^{N} \frac{1}{|\vec{r}_i - \vec{r}_j|}$ This term represents the potential energy from the electrostatic repulsion between all pairs of negatively charged electrons.该项表示所有带负电的电子对之间静电排斥的势能。
- The double summation runs over all unique pairs of electrons ($i, j$).对所有不同的电子对 ($i, j$) 进行双重求和。
- $\vec{r}_i$ is the position vector of electron $i$.是电子 $i$ 的位置矢量。
Nuclear-Electron Attraction 核-电子引力: $-e^2\sum_{I=1}^{P}\sum_{i=1}^{N} \frac{Z_I}{|\vec{R}_I - \vec{r}_i|}$ This final term represents the potential energy from the electrostatic attraction between the nuclei and the electrons.这最后一项表示原子核和电子之间静电引力的势能。
- The summation runs over all nuclei and all electrons.该求和适用于所有原子核和所有电子。

5. Notations and Conventions

Atomic Units: The note $\frac{1}{4\pi\epsilon_0} = k = 1$ is a key indicator of the convention being used. This sets the Coulomb constant to 1, which is a hallmark of Hartree atomic units. In this system, the elementary charge ($e$), electron mass ($m$), and reduced Planck constant ($\hbar$) are also set to 1. This simplifies the Hamiltonian significantly, removing the physical constants and making the equations easier to work with computationally. 是所用约定的关键指标。这将库仑常数设置为 1，这是Hartree 原子单位的标志。在这个系统中，基本电荷 ($e$)、电子质量 ($m$) 和约化普朗克常数 ($\hbar$) 也设为 1。这显著简化了哈密顿量，消除了物理常数，使方程更易于计算。
Interaction Terms: The notations $\{i, j\}$, $\{i, j, k\}$, etc., refer to the “many-body” problem. The Hamiltonian contains two-body terms (interactions between pairs of particles), and solving the Schrödinger equation exactly is extremely difficult because the motion of every particle is correlated with every other particle. Computational methods are designed to approximate these interactions. 符号 $\{i, j\}$、$\{i, j, k\}$ 等指的是“多体”问题。哈密顿量包含二体项（粒子对之间的相互作用），而精确求解薛定谔方程极其困难，因为每个粒子的运动都与其他粒子相关。计算方法旨在近似这些相互作用。

This whiteboard presents the mathematical foundation for non-adiabatic molecular dynamics, a sophisticated method in theoretical chemistry and physics used to simulate processes where the Born-Oppenheimer approximation breaks down. This typically occurs in photochemistry, electron transfer reactions, and when molecules interact with intense laser fields. 这块白板展示了非绝热分子动力学的数学基础，这是理论化学和物理学中一种复杂的方法，用于模拟玻恩-奥本海默近似失效的过程。这通常发生在光化学、电子转移反应以及分子与强激光场相互作用时。

6. Topic: Non-Adiabatic Molecular Dynamics (MD) 非绝热分子动力学 (MD)

The title “Δ non-adiabatic MD” indicates that the topic moves beyond the standard Born-Oppenheimer approximation. In this approximation, it is assumed that the light electrons adjust instantaneously to the motion of the heavy nuclei, allowing the system to be described by a single potential energy surface. Non-adiabatic methods, by contrast, account for the quantum mechanical coupling between multiple electronic states.

标题“Δ 非绝热 MD”表明该主题超越了标准的玻恩-奥本海默近似。在该近似中，假设轻电子会根据重原子核的运动进行瞬时调整，从而使系统可以用单个势能面来描述。相比之下，非绝热方法则考虑了多个电子态之间的量子力学耦合。

7. The Born-Huang Ansatz 玻恩-黄拟设

The starting point for this method is the “ansatz” (an educated guess for the form of the solution). This is the Born-Huang expansion for the total molecular wave function, $\Psi$. 该方法的起点是“拟设”（对解形式的合理猜测）。这是分子总波函数 $\Psi$ 的玻恩-黄展开式。

$\Psi(\vec{R}, \vec{r}, t) = \sum_{n} \Theta_n(\vec{R}, t) \Phi_n(\vec{R}, \vec{r})$

$\Psi(\vec{R}, \vec{r}, t)$ is the total wave function for the entire molecule. It depends on the coordinates of all nuclei ($\vec{R}$), all electrons ($\vec{r}$), and time ($t$). 是整个分子的总波函数。它取决于所有原子核 ($\vec{R}$)、所有电子 ($\vec{r}$) 和时间 ($t$) 的坐标。
$\Phi_n(\vec{R}, \vec{r})$ are the electronic wave functions. They are the solutions to the electronic Schrödinger equation for a fixed nuclear geometry $\vec{R}$ and form a complete basis set. The index $n$ labels the electronic state (e.g., ground state, first excited state, etc.). 它们是给定原子核几何构型 $\vec{R}$ 的电子薛定谔方程的解，并构成一个完整的基组。下标 $n$ 标记电子态（例如，基态、第一激发态等）。
$\Theta_n(\vec{R}, t)$ are the nuclear wave functions. Each $\Theta_n$ describes the motion of the nuclei on the potential energy surface of the corresponding electronic state, $\Phi_n$. Crucially, they depend on time. 是核波函数。每个 $\Theta_n$ 描述原子核在相应电子态 $\Phi_n$ 势能面上的运动。至关重要的是，它们依赖于时间。

This ansatz expresses the total molecular state as a superposition of electronic states, where the coefficients of the superposition are the nuclear wave functions. 该拟设将总分子态表示为电子态的叠加，其中叠加的系数是核波函数。

8. The Partitioned Molecular Hamiltonian 分割分子哈密顿量

The total molecular Hamiltonian, $\hat{\mathcal{H}}$, is partitioned into terms that act on the nuclei and electrons separately. 总分子哈密顿量 $\hat{\mathcal{H}}$ 被分割成分别作用于原子核和电子的项。

$\hat{\mathcal{H}} = -\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2 + \hat{\mathcal{H}}_e + \hat{V}_{nn}$

$-\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2$: This is the kinetic energy operator for the nuclei, often denoted as $\hat{T}_n$.这是原子核的动能算符，通常表示为 $\hat{T}_n$。
$\hat{\mathcal{H}}_e$: This is the electronic Hamiltonian, which includes the kinetic energy of the electrons and the potential energy of electron-electron and electron-nuclear interactions. 这是电子哈密顿量，包含电子的动能以及电子-电子和电子-核相互作用的势能。
$\hat{V}_{nn}$: This is the potential energy operator for nuclear-nuclear repulsion.这是核-核排斥的势能算符。

9. The Electronic Schrödinger Equation 电子薛定谔方程

The electronic basis functions, $\Phi_n$, are defined as the eigenfunctions of the electronic Hamiltonian (plus the nuclear repulsion term) for a fixed nuclear configuration $\vec{R}$. 电子基函数 $\Phi_n$ 定义为对于固定的核构型 $\vec{R}$，电子哈密顿量（加上核排斥项）的本征函数。

$(\hat{\mathcal{H}}_e + \hat{V}_{nn}) \Phi_n(\vec{R}, \vec{r}) = E_n(\vec{R}) \Phi_n(\vec{R}, \vec{r})$

$E_n(\vec{R})$ are the eigenvalues, which are the potential energy surfaces (PES). Each electronic state $n$ has its own PES, which dictates the forces acting on the nuclei when the molecule is in that electronic state. 是特征值，即势能面 (PES)。每个电子态 $n$ 都有其自身的势能面，它决定了分子处于该电子态时作用于原子核的力。

10. Deriving the Equations of Motion for the Nuclei 推导原子核运动方程

The final part of the whiteboard begins the derivation of the time-dependent Schrödinger equation for the nuclear wave functions, $\Theta_k$. The process starts with the full time-dependent Schrödinger equation, $i\hbar \frac{\partial \Psi}{\partial t} = \hat{\mathcal{H}} \Psi$. To find the equation for a specific nuclear wave function $\Theta_k$, this main equation is projected onto the corresponding electronic basis state $\Phi_k$. 白板的最后一部分开始推导原子核波函数 $\Theta_k$ 的含时薛定谔方程。该过程从完整的含时薛定谔方程 $i\hbar \frac{\partial \Psi}{\partial t} = \hat{\mathcal{H}} \Psi$ 开始。为了找到特定原子核波函数 $\Theta_k$ 的方程，需要将这个主方程投影到相应的电子基态 $\Phi_k$ 上。

This is done by multiplying from the left by the complex conjugate of the electronic wave function, $\Phi_k^*$, and integrating over all electronic coordinates, $d\vec{r}$. 可以通过从左边乘以电子波函数 $\Phi_k^*$ 的复共轭，然后在所有电子坐标 $d\vec{r}$ 上积分来实现。

$\int \Phi_k^* i\hbar \frac{\partial}{\partial t} \Psi \,d\vec{r} = \int \Phi_k^* \hat{\mathcal{H}} \Psi \,d\vec{r}$

The board then shows the result of substituting the Born-Huang ansatz for $\Psi$ and the partitioned Hamiltonian for $\hat{\mathcal{H}}$ into this projected equation: 然后，黑板显示将 Born-Huang 拟设式代入 $\Psi$，将分块哈密顿量代入以下投影方程的结果：

$i\hbar \frac{\partial}{\partial t} \Theta_k(\vec{R}, t) = \int \Phi_k^* \left( -\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2 + \hat{\mathcal{H}}_e + \hat{V}_{nn} \right) \sum_n \Theta_n \Phi_n \,d\vec{r}$

Left Hand Side: The left side of the projection has been simplified. Because the electronic basis functions $\Phi_n$ form an orthonormal set ($\int \Phi_k^* \Phi_n d\vec{r} = \delta_{kn}$), the sum collapses to a single term for $n=k$. 投影左侧已简化。由于电子基函数 $\Phi_n$ 构成一个正交集 ($\int \Phi_k^* \Phi_n d\vec{r} = \delta_{kn}$，因此当 $n=k$ 时，和将折叠为一个项。
Right Hand Side: This complex integral is the core of non-adiabatic dynamics. When the nuclear kinetic energy operator, $\nabla_I^2$, acts on the product $\Theta_n \Phi_n$, it acts on both functions (via the product rule). The terms that arise from $\nabla_I$ acting on the electronic wave functions $\Phi_n$ are known as non-adiabatic coupling terms. These terms are responsible for enabling transitions between different electronic potential energy surfaces, which is the essence of non-adiabatic dynamics. 这个复积分是非绝热动力学的核心。当核动能算符 $\nabla_I^2$ 作用于乘积 $\Theta_n \Phi_n$ 时，它会作用于这两个函数（通过乘积规则）。由 $\nabla_I$ 作用于电子波函数 $\Phi_n$ 而产生的项称为非绝热耦合项。这些术语负责实现不同电子势能面之间的转变，这是非绝热动力学的本质。

This whiteboard continues the mathematical derivation for non-adiabatic molecular dynamics started in the previous image. It focuses on expanding the nuclear kinetic energy term to reveal the crucial couplings between different electronic states.这块白板延续了上一张图片中非绝热分子动力学的数学推导。它着重于扩展核动能项，以揭示不同电子态之间的关键耦合。

11. Starting Point: The Projected Schrödinger Equation 起点：投影薛定谔方程

The derivation picks up from the equation for the time evolution of the nuclear wave function, $\Theta_k$. The right-hand side of this equation is being evaluated. 推导过程取自核波函数 $\Theta_k$ 的时间演化方程。该方程的右边正在求值。

$= \int \Phi_k^* \left( -\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2 \right) \sum_n \Theta_n \Phi_n \,d\vec{r} + E_k \Theta_k$

This equation separates the total energy into two parts 该方程将总能量分为两部分 : * The first term is the contribution from the nuclear kinetic energy operator, $-\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2$. 第一项是核动能算符的贡献 * The second term, $E_k \Theta_k$, is the contribution from the potential energy. This term arises from the action of the electronic Hamiltonian part $(\hat{\mathcal{H}}_e + \hat{V}_{nn})$ on the basis functions. Due to the orthonormality of the electronic wavefunctions ($\int \Phi_k^* \Phi_n \,d\vec{r} = \delta_{kn}$), the sum over $n$ collapses to a single term for the potential energy. 第二项，$E_k \Theta_k$，是势能的贡献。这一项源于电子哈密顿量部分 $(\hat{\mathcal{H}}_e + \hat{V}_{nn})$ 对基函数的作用。由于电子波函数（$\int \Phi_k^* \Phi_n \,d\vec{r} = \delta_{kn}$）的正交性，$n$项的和会坍缩为势能的一项。

The challenge, and the core of the physics, lies in evaluating the first term, as the nuclear derivative $\nabla_I$ acts on both the nuclear wave function $\Theta_n$ and the electronic wave function $\Phi_n$. 难点在于，也是物理的核心在于如何计算第一项，因为核导数 $\nabla_I$ 同时作用于核波函数 $\Theta_n$ 和电子波函数 $\Phi_n$。

12. Applying the Product Rule for the Laplacian 应用拉普拉斯算子的乘积规则

To expand the kinetic energy term, the product rule for the Laplacian operator acting on two functions (A and B) is used. The board writes this rule as: 为了展开动能项，我们利用了拉普拉斯算子作用于两个函数（A 和 B）的乘积规则。棋盘上将这条规则写成： $\nabla^2(AB) = (\nabla^2 A)B + 2(\nabla A)\cdot(\nabla B) + A(\nabla^2 B)$

In our case, $A = \Theta_n(\vec{R}, t)$ and $B = \Phi_n(\vec{R}, \vec{r})$. The derivative $\nabla_I$ is with respect to the nuclear coordinates $\vec{R}_I$. 在我们的例子中，$A = \Theta_n(\vec{R}, t)$，$B = \Phi_n(\vec{R}, \vec{r})$。导数 $\nabla_I$ 是关于原子核坐标 $\vec{R}_I$ 的。

13. Expanding the Kinetic Energy Term 展开动能项

Applying this rule, the integral containing the kinetic energy operator is expanded: 应用此规则，展开包含动能算符的积分： $= -\sum_I \frac{\hbar^2}{2M_I} \int \Phi_k^* \sum_n \left( (\nabla_I^2 \Theta_n)\Phi_n + 2(\nabla_I \Theta_n)\cdot(\nabla_I \Phi_n) + \Theta_n(\nabla_I^2 \Phi_n) \right) d\vec{r} + E_k \Theta_k$

This step explicitly shows how the nuclear kinetic energy operator gives rise to three distinct types of terms.此步骤明确展示了核动能算符如何产生三种不同类型的项。

14. Final Result and Identification of Coupling Terms 最终结果及耦合项的识别

The final step is to take the integral over the electronic coordinates ($d\vec{r}$) and rearrange the terms. The expression is simplified by again using the orthonormality of the electronic wave functions, $\int \Phi_k^* \Phi_n \, d\vec{r} = \delta_{kn}$. 最后一步是对电子坐标 ($d\vec{r}$) 进行积分，并重新排列各项。再次利用电子波函数的正交性简化表达式，$\int \Phi_k^* \Phi_n \, d\vec{r} = \delta_{kn}$。

$= -\sum_I \frac{\hbar^2}{2M_I} \left( \nabla_I^2 \Theta_k + \sum_n 2 \left( \int \Phi_k^* \nabla_I \Phi_n \, d\vec{r} \right) \cdot \nabla_I \Theta_n + \sum_n \left( \int \Phi_k^* \nabla_I^2 \Phi_n \, d\vec{r} \right) \Theta_n \right) + E_k \Theta_k$

This final equation is profound. It represents the time-independent Schrödinger equation for the nuclear wave function $\Theta_k$, but it is coupled to all other nuclear wave functions $\Theta_n$. Let’s break down the key terms within the parentheses: 最后一个方程意义深远。它代表了核波函数 $\Theta_k$ 的与时间无关的薛定谔方程，但它与所有其他核波函数 $\Theta_n$ 耦合。让我们分解一下括号内的关键项：

$\nabla_I^2 \Theta_k$: This is the standard kinetic energy term for the nuclei moving on the potential energy surface of state $k$. This is the only term that would remain in the simple Born-Oppenheimer (adiabatic) approximation. 这是原子核在势能面 $k$ 上运动的标准动能项。这是在简单的 Born-Oppenheimer（绝热）近似中唯一保留的项。
$\left( \int \Phi_k^* \nabla_I \Phi_n \, d\vec{r} \right)$: This is the first-derivative non-adiabatic coupling term (NACT), often called the derivative coupling. This vector quantity determines the strength of the coupling between electronic states $k$ and $n$ due to the velocity of the nuclei. It is the primary term responsible for enabling transitions between different potential energy surfaces. 这是一阶导数非绝热耦合项 (NACT)，通常称为导数耦合。该矢量决定了由于原子核速度而导致的电子态 $k$ 和 $n$ 之间耦合的强度。它是实现不同势能面之间跃迁的主要项。
$\left( \int \Phi_k^* \nabla_I^2 \Phi_n \, d\vec{r} \right)$: This is the second-derivative non-adiabatic coupling term, a scalar quantity. While often smaller than the first-derivative term, it is also part of the complete description of non-adiabatic effects. 是二阶导数非绝热耦合项，一个标量。虽然它通常小于一阶导数项，但它也是非绝热效应完整描述的一部分。

In summary, this derivation shows mathematically how the motion of the nuclei (via the $\nabla_I$ operator) can induce quantum mechanical transitions between different electronic states ($\Phi_k \leftrightarrow \Phi_n$). The strength of these transitions is governed by the non-adiabatic coupling terms, which depend on how the electronic wave functions change as the nuclear geometry changes. 总之，该推导从数学上展示了原子核的运动（通过 $\nabla_I$ 算符）如何诱导不同电子态之间的量子力学跃迁（$\Phi_k \leftrightarrow \Phi_n$）。这些跃迁的强度由非绝热耦合项控制，而非绝热耦合项又取决于电子波函数如何随原子核几何结构的变化而变化。

This whiteboard concludes the derivation of the equations for non-adiabatic molecular dynamics by defining the coupling operator and then showing how different levels of approximation—specifically the Born-Huang and the more restrictive Born-Oppenheimer approximations—arise from neglecting certain coupling terms. 这块白板通过定义耦合算符，并展示不同程度的近似——特别是 Born-Huang 近似和更严格的 Born-Oppenheimer 近似——是如何通过忽略某些耦合项而产生的，从而推导出非绝热分子动力学方程的。

15. Definition of the Non-Adiabatic Coupling Operator 非绝热耦合算符的定义

The whiteboard begins by collecting all the non-adiabatic coupling terms derived previously into a single operator, $C_{kn}$. 白板首先将之前推导的所有非绝热耦合项合并为一个算符 $C_{kn}$。

Let $C_{kn} = -\sum_{I} \frac{\hbar^2}{2M_I} \left( 2 \left( \int \Phi_k^* \nabla_I \Phi_n \, d\vec{r} \right) \cdot \nabla_I + \left( \int \Phi_k^* \nabla_I^2 \Phi_n \, d\vec{r} \right) \right)$

This operator, $C_{kn}$, represents the total effect of the coupling between electronic state $k$ and electronic state $n$, which is induced by the kinetic energy of the nuclei. 此算符 $C_{kn}$ 表示由原子核动能引起的电子态 $k$ 和电子态 $n$ 之间耦合的总效应。
The operator acts on the nuclear wave function that follows it in the full equation. The $\nabla_I$ term acts as a derivative on that wave function. 该算符作用于完整方程中跟随它的核波函数。$\nabla_I$ 项充当该波函数的导数。

16. The Coupled Equations of Motion 耦合运动方程

Using this compact definition, the full set of coupled time-dependent Schrödinger equations for the nuclear wave functions can be written as: 基于此简洁定义，核波函数的完整耦合含时薛定谔方程组可以写成：

$i\hbar \frac{\partial}{\partial t} \Theta_k = \left( -\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2 + E_k \right) \Theta_k + \sum_n C_{kn} \Theta_n$

This is the central result. It shows that the time evolution of the nuclear wave function on a given potential energy surface $k$ (described by $\Theta_k$) depends on two things: 这是核心结论。它表明，核波函数在给定势能面 $k$（用 $\Theta_k$ 描述）上的时间演化取决于两个因素： 1. The motion on its own surface, governed by its kinetic energy and the potential $E_k$. 其自身表面上的运动，由其动能和势能 $E_k$ 控制。 2. The influence of the nuclear wave functions on all other electronic surfaces ($\Theta_n$), mediated by the coupling operators $C_{kn}$. 核波函数对所有其他电子表面（$\Theta_n$）的影响，由耦合算符 $C_{kn}$ 介导。

17. The Born-Huang Approximation 玻恩-黄近似

The first and most crucial approximation is introduced to simplify this complex set of coupled equations. 为了简化这组复杂的耦合方程，引入了第一个也是最重要的近似。

If $C_{kn} = 0$ for $k \neq n$ (Born-Huang approximation)

This approximation assumes that the off-diagonal coupling terms, which are responsible for transitions between different electronic states, are negligible. However, it retains the diagonal coupling term ($C_{kk}$). This leads to a simplified, uncoupled equation: 该近似假设导致不同电子态之间跃迁的非对角耦合项可以忽略不计。然而，它保留了对角耦合项（$C_{kk}$）。这可以得到一个简化的非耦合方程：

$i\hbar \frac{\partial}{\partial t} \Theta_k = \left( -\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2 + E_k + C_{kk} \right) \Theta_k$

Substituting the definition of $C_{kk}$: 代入 $C_{kk}$ 的定义：

$i\hbar \frac{\partial}{\partial t} \Theta_k = \left( -\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2 + E_k - \sum_I \frac{\hbar^2}{2M_I} \left( 2 \left( \int \Phi_k^* \nabla_I \Phi_k \, d\vec{r} \right) \cdot \nabla_I + \int \Phi_k^* \nabla_I^2 \Phi_k \, d\vec{r} \right) \right) \Theta_k$

The term $C_{kk}$ is known as the diagonal Born-Oppenheimer correction (DBOC). It represents a small correction to the potential energy surface $E_k$ that arises from the fact that the electrons do not adjust perfectly and instantaneously to the nuclear motion, even within the same electronic state. $C_{kk}$ 项被称为对角玻恩-奥本海默修正 (DBOC)。它表示对势能面 $E_k$ 的微小修正，其原因是即使在相同的电子态下，电子也无法完美且即时地适应核运动。

Note on Real Wavefunctions 关于实波函数的注释: The board shows that for real wavefunctions, the first-derivative part of the diagonal correction vanishes: $\int \Phi_k \nabla_I \Phi_k \, d\vec{r} = 0$. This is because the integral is related to the gradient of the normalization condition, $\nabla_I \int \Phi_k^2 \, d\vec{r} = \nabla_I(1) = 0$, which expands to $2\int \Phi_k \nabla_I \Phi_k \, d\vec{r} = 0$. 黑板显示，对于实波函数，对角修正的一阶导数部分为零：$\int \Phi_k \nabla_I \Phi_k \, d\vec{r} = 0$。这是因为积分与归一化条件的梯度有关，$\nabla_I \int \Phi_k^2 \, d\vec{r} = \nabla_I(1) = 0$，其展开为 $2\int \Phi_k \nabla_I \Phi_k \, d\vec{r} = 0$。

18. The Born-Oppenheimer Approximation 玻恩-奥本海默近似

The final and most widely used approximation is the Born-Oppenheimer approximation. It is more restrictive than the Born-Huang approximation. 最后一种也是最广泛使用的近似方法是玻恩-奥本海默近似。它比玻恩-黄近似更具限制性。

If $C_{kk} = 0$ (Born-Oppenheimer approximation) 若$C_{kk} = 0$（玻恩-奥本海默近似）

This assumes that the diagonal correction term is also negligible. By setting all $C_{kn}=0$ (both diagonal and off-diagonal), the equations become completely decoupled, and the nuclear motion evolves independently on each potential energy surface. 这假设对角修正项也可忽略不计。通过令所有$C_{kn}=0$（包括对角和非对角），方程组完全解耦，原子核运动在每个势能面上独立演化。

The result is the standard time-dependent Schrödinger equation for the nuclei: 由此可得标准的原子核的含时薛定谔方程：

$i\hbar \frac{\partial}{\partial t} \Theta_k = \left( -\sum_{I} \frac{\hbar^2}{2M_I}\nabla_I^2 + E_k \right) \Theta_k$

This equation is the foundation of most of quantum chemistry. It states that the nuclei move on a static potential energy surface $E_k(\vec{R})$ provided by the electrons, without any possibility of transitioning to other electronic states or having the surface be corrected by their own motion.

该方程是大多数量子化学的基础。原子核在由电子提供的静态势能面 $E_k(\vec{R})$ 上运动，不存在跃迁到其他电子态或因自身运动而修正势能面的可能性。

BLOGS - IMG Assert

发表于 2025-09-18 更新于 2025-09-19 分类于技术

【问题】主要为了图像不显示问题

Step1:根目录中的配置文件

Step2:将 Markdown 行替换为HTML 代码

Step3:设置下方添加ROOT

Step4:不需要此插件终端中运行以下命令来卸载插件：

$ # URL
## Set your site url here. For example, if you use GitHub Page, set url as 'https://username.github.io/project'
$ url: https://TianyaoBlogs.github.io/

$ root: /

$ permalink: :year/:month/:day/:title/

1	$ <img src="/imgs/5054C3/General_linear_regression_model.png" alt="A diagram of the general linear regression model">

1	$ npm uninstall hexo-asset-image

PHYS 5120 - Computational Energy Materials and Electronic Structure Simulations-W3-1

发表于 2025-09-17 更新于 2025-09-19 分类于 PHYS-5120

PHYS 5120 - 计算能源材料和电子结构模拟 Lecture-3

Lecturer: Prof.PAN DING

1 radial distribution function:

内容:

This whiteboard explains the process of calculating the radial distribution function, often denoted as $g(r)$, to analyze the atomic structure of a material, which is referred to here as a “film”. 本白板解释了计算径向分布函数（通常表示为 $g(r)$）的过程，用于分析材料（本文中称为“薄膜”）的原子结构。

In simple terms, the radial distribution function tells you the probability of finding an atom at a certain distance from another reference atom. It’s a powerful way to see the local structure in a disordered system like a liquid or an amorphous solid.

简单来说，径向分布函数表示在距离另一个参考原子一定距离处找到一个原子的概率。它是观察无序系统（例如液体或非晶态固体）局部结构的有效方法。

## Core Concept: Radial Distribution Function 径向分布函数

The main goal is to compute the radial distribution function, $g(r)$, which is defined as the ratio of the actual number of atoms found in a thin shell at a distance $r$ to the number of atoms you’d expect to find if the material were an ideal gas (completely random). 主要目标是计算径向分布函数 $g(r)$，其定义为在距离 $r$ 的薄壳层中实际发现的原子数与材料为理想气体（完全随机）时预期发现的原子数之比。

The formula is expressed as: \[g(r)dr = \frac{n(r)}{\text{ideal gas}}\]

$n(r)$: Represents the average number of atoms found in a thin spherical shell between a distance $r$ and $r+dr$ from a central atom. 表示距离中心原子 $r$ 到 $r+dr$ 之间的薄球壳中原子的平均数量。
ideal gas: Represents the number of atoms you would expect in that same shell if the atoms were distributed completely randomly with the same average density ($\rho$). The volume of this shell is approximately $4\pi r^2 dr$.表示如果原子完全随机分布且平均密度 ($\rho$) 相同，则该球壳中原子的数量。该球壳的体积约为 $4\pi r^2 dr$。

A peak in the $g(r)$ plot indicates a high probability of finding neighboring atoms at that specific distance, revealing the material’s structural shells (e.g., nearest neighbors, second-nearest neighbors, etc.).$g(r)$ 图中的峰值表示在该特定距离处找到相邻原子的概率很高，从而揭示了材料的结构壳（例如，最近邻、次近邻等）。

## Calculation Method

The board outlines a two-step averaging process to get a statistically meaningful result from simulation data (a “film” at 20 frames per second).

Average over atoms: In a single frame (a snapshot in time), you pick one atom as the center. Then, you count how many other atoms ($n(r)$) are in concentric spherical shells around it. This process is repeated, treating each atom in the frame as the center, and the results are averaged.
Average over frames: The entire process described above is repeated for multiple frames from the simulation or video. This time-averaging ensures that the final result represents the typical structure of the material over time, smoothing out random fluctuations.

The board notes “dx = bin width 0.01Å”, which is a practical detail for the calculation. To create a histogram, the distance r is divided into small segments (bins) of 0.01 angstroms.

## Connection to Experiments

Finally, the whiteboard mentions “frame X-ray scattering”. This is a crucial point because it connects this computational analysis to real-world experiments. Experimental techniques like X-ray or neutron scattering can be used to measure a quantity called the structure factor, $S(q)$, which is directly related to the radial distribution function $g(r)$ through a mathematical operation called a Fourier transform. This allows scientists to directly compare the structure produced in their simulations with the structure of a real material measured in a lab. 最后，白板上提到了“帧 X 射线散射”。这一点至关重要，因为它将计算分析与实际实验联系起来。X射线或中子散射等实验技术可以用来测量一个称为结构因子$S(q)$的量，该量通过傅里叶变换的数学运算与径向分布函数$g(r)$直接相关。这使得科学家能够直接将模拟中产生的结构与实验室测量的真实材料结构进行比较。

The board correctly links $g(r)$ to X-ray scattering experiments. The quantity measured in these experiments is the static structure factor, $S(q)$, which describes how the material scatters radiation. The relationship between the two is a Fourier transform: 该板正确地将$g(r)$与X射线散射实验联系起来。这些实验中测量的量是静态结构因子$S(q)$，它描述了材料如何散射辐射。两者之间的关系是傅里叶变换： \[S(q) = 1 + 4 \pi \rho \int_0^\infty [g(r) - 1] r^2 \frac{\sin(qr)}{qr} dr\] This equation is crucial because it bridges the gap between computer simulations (which calculate $g(r)$) and physical experiments (which measure $S(q)$). 这个方程至关重要，因为它弥合了计算机模拟（计算 $g(r)$）和物理实验（测量 $S(q)$）之间的差距。

## 2. The Gaussian Distribution: Probability of Particle Position 高斯分布：粒子位置的概率

The board starts with the formula for a one-dimensional Gaussian (or normal) distribution: 白板首先展示的是一维高斯（或正态）分布的公式：

\[f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

This equation describes the probability of finding a particle at a specific position x after a certain amount of time has passed. * $\mu$ (mu) is the mean or average position. For a simple diffusion process starting at the origin, the particles spread out symmetrically, so the average position remains at the origin ($\mu = 0$). * $\sigma^2$ (sigma squared) is the variance, which measures how spread out the particles are from the mean position. A larger variance means the particles have, on average, traveled farther from the starting point. 这个方程描述了经过一定时间后，在特定位置“x”找到粒子的概率。 * $\mu$ (mu) 是平均值或平均位置。对于从原点开始的简单扩散过程，粒子对称扩散，因此平均位置保持在原点（$\mu = 0$）。 * $\sigma^2$（sigma 平方） 是方差，用于衡量粒子与平均位置的扩散程度。方差越大，意味着粒子平均距离起点越远。

The note “Black-Scholes” is a side reference. The Black-Scholes model, famous in financial mathematics for pricing options, uses similar mathematical principles based on Brownian motion to model the random fluctuations of stock prices. “Black-Scholes”注释仅供参考。Black-Scholes 模型在金融数学中以期权定价而闻名，它使用基于布朗运动的类似数学原理来模拟股票价格的随机波动。

## 3. Mean Squared Displacement (MSD): Quantifying the Spread 均方位移 (MSD)：量化扩散

The core of the board is dedicated to the Mean Squared Displacement (MSD). This is the primary tool used to measure how far, on average, particles have moved over a time interval t. 本版块的核心内容是均方位移 (MSD)。这是用于测量粒子在时间间隔“t”内平均移动距离的主要工具。

The variance $\sigma^2$ is formally defined as the average of the squared deviations from the mean: \[\sigma^2 = \langle x^2(t) \rangle - \langle x(t) \rangle^2\] * $\langle x(t) \rangle$ is the average displacement. As mentioned, for simple diffusion, $\langle x(t) \rangle = 0$. * $\langle x^2(t) \rangle$ is the average of the square of the displacement. 方差$\sigma^2$的正式定义为与平均值偏差平方的平均值： \[\sigma^2 = \langle x^2(t) \rangle - \langle x(t) \rangle^2\] * $\langle x(t) \rangle$是平均位移。如上所述，对于简单扩散，$\langle x(t) \rangle = 0$。 * $\langle x^2(t) \rangle$是位移平方的平均值。

Since $\langle x(t) \rangle = 0$, the variance is simply equal to the MSD: \[\sigma^2 = \langle x^2(t) \rangle\] 由于 $\langle x(t) \rangle = 0$，方差等于均方差 (MSD)： \[\sigma^2 = \langle x^2(t) \rangle\]

The crucial insight for a diffusive process is that the MSD grows linearly with time. The rate of this growth is determined by the diffusion coefficient, D. The board shows this relationship for different dimensions: 扩散过程的关键在于MSD 随时间线性增长。其增长率由扩散系数 D决定。棋盘显示了不同维度下的这种关系：

1D: $\langle x^2(t) \rangle = 2Dt$ (Movement along a line) （沿直线运动）
2D: The board has a slight typo or ambiguity with $\langle z^2(t) \rangle = 2Dt$. For 2D motion in the x-y plane, the total MSD would be $\langle r^2(t) \rangle = \langle x^2(t) \rangle + \langle y^2(t) \rangle = 4Dt$. The note on the board might be referring to just one component of motion. **棋盘上的 $\langle z^2(t) \rangle = 2Dt$ 存在轻微拼写错误或歧义。对于 x-y 平面上的二维运动，总平均散射差 (MSD) 为 $\langle r^2(t) \rangle = \langle x^2(t) \rangle + \langle y^2(t) \rangle = 4Dt$。黑板上的注释可能仅指运动的一个分量。
3D: $\langle r^2(t) \rangle = \langle |\vec{r}(t) - \vec{r}(0)|^2 \rangle = 6Dt$ (Movement in 3D space, which is the most common case in molecular simulations) （三维空间中的运动，这是分子模拟中最常见的情况） Here, $\vec{r}(t)$ is the position vector of a particle at time t. The quantity $\langle |\vec{r}(t) - \vec{r}(0)|^2 \rangle$ is the average of the squared distance a particle has traveled from its initial position $\vec{r}(0)$. 这里，$\vec{r}(t)$ 是粒子在时间 t 的位置矢量。 $\langle |\vec{r}(t) - \vec{r}(0)|^2 \rangle$ 是粒子从其初始位置 $\vec{r}(0)$ 行进距离的平方平均值。

## 4. The Einstein Relation: Connecting Microscopic Motion to a Macroscopic Property 爱因斯坦关系：将微观运动与宏观特性联系起来

Finally, the board presents the famous Einstein relation, which rearranges the 3D MSD equation to solve for the diffusion coefficient D:

\[D = \lim_{t \to \infty} \frac{\langle |\vec{r}(t) - \vec{r}(0)|^2 \rangle}{6t}\]

This is a cornerstone equation in statistical mechanics. It provides a practical way to calculate a macroscopic property—the diffusion coefficient D—from the microscopic movements of individual particles observed in a computer simulation. 这是统计力学中的一个基石方程。它提供了一种实用的方法，可以通过计算机模拟中观察到的单个粒子的微观运动来计算宏观属性——扩散系数“D”。

In practice, one would: 1. Run a simulation of particles. 运行粒子模拟。 2. Track the position of each particle over time. 跟踪每个粒子随时间的位置。 3. Calculate the squared displacement $|\vec{r}(t) - \vec{r}(0)|^2$ for each particle at various time intervals t. 计算每个粒子在不同时间间隔“t”的位移平方$|\vec{r}(t) - \vec{r}(0)|^2$。 4. Average this value over all particles to get the MSD, $\langle |\vec{r}(t) - \vec{r}(0)|^2 \rangle$. 对所有粒子取平均值，得到均方差（MSD），即$\langle |\vec{r}(t) - \vec{r}(0)|^2 \rangle$。 5. Plot the MSD as a function of time. 将MSD绘制成时间函数。 6. The slope of this line, divided by 6, gives the diffusion coefficient D. The lim t→∞ indicates that this linear relationship is most accurate for long time scales, after initial transient effects have died down. 这条直线的斜率除以6，即扩散系数“D”。“lim t→∞”表明，在初始瞬态效应消退后，这种线性关系在长时间尺度上最为准确。

## 5. Right Board: Green-Kubo Relations

This board introduces a more advanced and powerful method to calculate transport coefficients like the diffusion coefficient, known as the Green-Kubo relations. 本面板介绍了一种更先进、更强大的方法来计算扩散系数等传输系数，即Green-Kubo 关系。

### Velocity Autocorrelation Function (VACF) 速度自相关函数 (VACF)

The key idea is to look at how a particle’s velocity at one point in time is related to its velocity at a later time. This is measured by the Velocity Autocorrelation Function (VACF): \[C_{vv}(t) = \langle \vec{v}(t') \cdot \vec{v}(t' + t) \rangle\] This function tells us how long a particle “remembers” its velocity. For a typical liquid, the velocity is quickly randomized by collisions, so the VACF decays to zero rapidly. 其核心思想是考察粒子在某一时间点的速度与其在之后时间点的速度之间的关系。这可以通过速度自相关函数 (VACF)来测量： \[C_{vv}(t) = \langle \vec{v}(t') \cdot \vec{v}(t' + t) \rangle\] 此函数告诉我们粒子“记住”其速度的时间。对于典型的液体，速度会因碰撞而迅速随机化，因此 VACF 会迅速衰减为零。

### Connecting MSD and VACF

The board shows the mathematical link between the MSD and the VACF. Starting with the definition of position as the integral of velocity, $\vec{r}(t) = \int_0^t \vec{v}(t') dt'$, one can show that the MSD is a double integral of the VACF. The board writes this as: \[\langle x^2(t) \rangle = \left\langle \left( \int_0^t v(t') dt' \right) \left( \int_0^t v(t'') dt'' \right) \right\rangle = \int_0^t dt' \int_0^t dt'' \langle v(t') v(t'') \rangle\] This shows that the two pictures of motion—the particle’s displacement (MSD) and its velocity fluctuations (VACF)—are deeply connected. 该面板展示了 MSD 和 VACF 之间的数学联系。从位置定义为速度的积分开始，$\vec{r}(t) = \int_0^t \vec{v}(t') dt'$，可以证明 MSD 是 VACF 的二重积分。黑板上写着： \[\langle x^2(t) \rangle = \left\langle \left( \int_0^t v(t') dt' \right) \left( \int_0^t v(t'') dt'' \right) \right\rangle = \int_0^t dt' \int_0^t dt'' \langle v(t') v(t'') \rangle\] 这表明，粒子运动的两幅图像——粒子的位移（MSD）和速度涨落（VACF）——之间存在着深刻的联系。

### The Green-Kubo Formula for Diffusion 扩散的格林-久保公式

By combining the Einstein relation with the integral of the VACF, one arrives at the Green-Kubo formula for the diffusion coefficient: \[D = \frac{1}{3} \int_0^\infty \langle \vec{v}(0) \cdot \vec{v}(t) \rangle dt\] This incredible result states that the macroscopic property of diffusion ($D$) is determined by the integral of the microscopic velocity correlations. It’s often a more efficient way to compute $D$ in simulations than calculating the long-time limit of the MSD. 将爱因斯坦关系与VACF积分相结合，可以得到扩散系数的格林-久保公式： \[D = \frac{1}{3} \int_0^\infty \langle \vec{v}(0) \cdot \vec{v}(t) \rangle dt\] 这个令人难以置信的结果表明，扩散的宏观特性（$D$）由微观速度关联的积分决定。在模拟中，这通常是计算$D$比计算MSD的长期极限更有效的方法。

## 6. The Grand Narrative: From Micro to Macro 宏大叙事：从微观到宏观

The previous whiteboards gave us two ways to calculate the diffusion constant, D, from the microscopic random walk of individual atoms: 之前的白板提供了两种从单个原子的微观随机游动计算扩散常数 D的方法： 1. Einstein Relation: From the long-term slope of the Mean Squared Displacement (MSD). 根据均方位移 (MSD) 的长期斜率。 2. Green-Kubo Relation: From the integral of the Velocity Autocorrelation Function (VACF). 根据速度自相关函数 (VACF) 的积分。

This new whiteboard shows how that single microscopic parameter, D, governs the large-scale, observable process of diffusion described by Fick’s Laws and the Diffusion Equation. 这块新的白板展示了单个微观参数 D 如何控制菲克定律和扩散方程所描述的大规模可观测扩散过程。

## 1. The Starting Point: A Liquid’s Structure 起点：液体的结构

The plot on the top left is the Radial Distribution Function, $g(r)$, which we discussed in detail from the first whiteboard. 左上角的图是径向分布函数 $g(r)$，我们在第一个白板上详细讨论过它。

The Plot: It shows the characteristic structure of a liquid. The peaks are labeled “1st”, “2nd”, and “3rd”, corresponding to the first, second, and third solvation shells (layers of neighboring atoms). 它显示了液体的特征结构。峰分别标记为“第一”、“第二”和“第三”，分别对应于第一、第二和第三溶剂化壳层（相邻原子层）。
The Limit: The note lim r→∞ g(r) = 1 confirms that at large distances, the liquid has no long-range order, as expected.注释“lim r→∞ g(r) = 1”证实了在远距离下，液体没有长程有序，这与预期一致。
System Parameters: The values T = 0.71 and ρ = 0.844 are the temperature and density of the simulated system (likely in reduced or “Lennard-Jones” units) for which this $g(r)$ was calculated. 值“T = 0.71”和“ρ = 0.844”分别是模拟系统的温度和密度（可能采用约化或“Lennard-Jones”单位），用于计算此 $g(r)$。

This section sets the stage: we are looking at the dynamics within a system that has this specific liquid-like structure. 本节奠定了基础：我们将研究具有特定类液体结构的系统内的动力学。

## 2. The Macroscopic Laws of Diffusion 宏观扩散定律

The bottom-left and top-right sections introduce the continuum equations that describe how concentration changes in space and time. 左下角和右上角部分介绍了描述浓度随空间和时间变化的连续方程。左下角和右上角部分介绍了描述浓度随空间和时间变化的连续方程。

### Fick’s First Law 菲克第一定律

\[\vec{J} = -D \nabla C\] This is Fick’s first law of diffusion. It states that there is a flux of particles ($\vec{J}$), meaning a net flow. This flow is directed from high concentration to low concentration (hence the minus sign) and its magnitude is proportional to the concentration gradient ($\nabla C$). 这是菲克第一扩散定律。它指出存在粒子的通量 ($\vec{J}$)，即净流量。该流量从高浓度流向低浓度（因此带有负号），其大小与浓度梯度 ($\nabla C$) 成正比。

The Crucial Link: The proportionality constant is D, the very same diffusion constant we calculated from the microscopic random walk (MSD/VACF). This is the key connection: the collective result of countless individual random walks is a predictable net flow of particles. 比例常数是D，与我们根据微观随机游走 (MSD/VACF) 计算出的扩散常数完全相同。这是关键的联系：无数个体随机游动的集合结果是可预测的粒子净流。

### The Diffusion Equation (Fick’s Second Law) 扩散方程（菲克第二定律）

\[\frac{\partial C(\vec{r},t)}{\partial t} = D \nabla^2 C(\vec{r},t)\] This is the diffusion equation, one of the most important equations in physics and chemistry (also called the heat equation, as noted). It’s derived from Fick’s first law and the principle of mass conservation ($\frac{\partial C}{\partial t} + \nabla \cdot \vec{J} = 0$). It’s a differential equation that tells you exactly how the concentration at any point, $C(\vec{r},t)$, will change over time. 这就是扩散方程，它是物理学和化学中最重要的方程之一（也称为热方程）。它源于菲克第一定律和质量守恒定律（$\frac{\partial C}{\partial t} + \nabla \cdot \vec{J} = 0$）。它是一个微分方程，可以精确地告诉你任意一点的浓度 $C(\vec{r},t)$ 随时间的变化。

## 3. The Solution: Connecting Back to the Random Walk 与随机游动联系起来

This is the most beautiful part. The board shows the solution to the diffusion equation for a very specific scenario, linking the macroscopic equation directly back to the microscopic random walk. 黑板上展示了一个非常具体场景下扩散方程的解，将宏观方程直接与微观随机游动联系起来。

### The Initial Condition 初始条件

The problem is set up by assuming all particles start at a single point at time zero: \[C(\vec{r}, 0) = \delta(\vec{r})\] This is a Dirac delta function, representing an infinitely concentrated point source at the origin. 问题假设所有粒子在时间零点处从一个点开始： \[C(\vec{r}, 0) = \delta(\vec{r})\] 这是一个狄拉克函数，表示一个在原点处无限集中的点源。

### The Fundamental Solution (Green’s Function) 基本解（格林函数）

The solution to the diffusion equation with this starting condition is called the fundamental solution or Green’s function. For one dimension, it is: \[C(x,t) = \frac{1}{\sqrt{4\pi Dt}} \exp\left(-\frac{x^2}{4Dt}\right)\]

The “Aha!” Moment: This is a Gaussian distribution. Let’s compare it to the formula from the second whiteboard: * The mean is $\mu=0$. 均值为 $\mu=0$。 * The variance is $\sigma^2 = 2Dt$. 方差为 $\sigma^2 = 2Dt$。

This is an incredible result. The macroscopic diffusion equation predicts that a concentration pulse will spread out over time, and the shape of the concentration profile will be a Gaussian curve. The width of this curve, measured by its variance $\sigma^2$, is exactly the Mean Squared Displacement, $\langle x^2(t) \rangle$, of the individual random-walking particles. 宏观扩散方程预测浓度脉冲会随时间扩散，浓度分布的形状将是高斯曲线。这条曲线的宽度，用其方差 $\sigma^2$ 来衡量，恰好是单个随机游动粒子的均方位移 $\langle x^2(t) \rangle$。

This perfectly unites the two perspectives: * Microscopic微观 (Board 2): Particles undergo a random walk, and their average squared displacement from the origin grows as $\langle x^2(t) \rangle = 2Dt$. 粒子进行随机游动，它们相对于原点的平均平方位移随着 $\langle x^2(t) \rangle = 2Dt$ 的增长而增长。 * Macroscopic宏观 (This Board): A collection of these particles, described by a continuum concentration C, spreads out in a Gaussian profile whose variance is $\sigma^2 = 2Dt$. 这些粒子的集合，用连续浓度“C”来描述，呈方差为 $\sigma^2 = 2Dt$ 的高斯分布。

The two pictures are mathematically identical.

PHYS 5120 - Computational Energy Materials and Electronic Structure Simulations-W3-2

发表于 2025-09-17 更新于 2025-09-21 分类于 PHYS-5120

PHYS 5120 - 计算能源材料和电子结构模拟 Lecture-3

Lecturer: Prof.PAN DING

1 radial distribution function RDF静态结构:

内容: This whiteboard serves as an excellent summary, pulling together all the key concepts we’ve discussed into a single, cohesive picture. Let’s connect everything on this slide to our detailed conversation.

1. RDF: The Static Structure RDF静态结构

On the top left, you see RDF (Radial Distribution Function).

The Plots: The board shows the familiar $g(r)$ plot with its characteristic peaks for a liquid. Below it is a plot of the interatomic potential energy, $V(r)$. This addition is very insightful! It shows why the first peak in $g(r)$ exists: it corresponds to the minimum energy distance ($\sigma$) where particles are most stable and likely to be found. 白板展示了我们熟悉的$g(r)$图，它带有液体的特征峰。下方是原子间势能$V(r)$的图。这个补充非常有见地！它解释了为什么 $g(r)$ 中的第一个峰值存在：它对应于粒子最稳定且最有可能被发现的最小能量距离 ($\sigma$)。
Connection: This section summarizes our first discussion. It’s the starting point for our analysis—a static snapshot of the material’s average atomic arrangement before we consider how the atoms move. 本节总结了我们的第一个讨论。这是我们分析的起点——在我们考虑原子如何运动之前，它是材料平均原子排列的静态快照。

2. MSD and The Einstein Relation: The Displacement Picture 均方位移 (MSD) 和爱因斯坦关系：位移图像

The board then moves to dynamics, presenting two methods to calculate the diffusion constant, D. The first is the Einstein relation. 两种计算扩散常数 D的方法。第一种是爱因斯坦关系。

The Formula: It correctly states that the Mean Squared Displacement (MSD), $\langle r^2 \rangle$, is equal to $6Dt$ in three dimensions. It then rearranges this to solve for $D$: 它正确地指出了均方位移 (MSD)，$\langle r^2 \rangle$，在三维空间中等于 $6Dt$。然后重新排列该公式以求解 $D$： \[D = \lim_{t\to\infty} \frac{\langle |\vec{r}(t) - \vec{r}(0)|^2 \rangle}{6t}\]
The Diagram: The central diagram beautifully illustrates the concept. It shows a particle in a simulation box (with “N=108” likely being the number of particles simulated) moving from an initial position $\vec{r}_i(0)$ to a final position $\vec{r}_i(t_j)$. The MSD is the average of the square of this displacement over all particles and many time origins. The graph labeled “MSD” shows how you would plot this data and find the slope (“fitting”) to calculate $D$. 中间的图表完美地阐释了这个概念。它展示了一个粒子在模拟框中（“N=108” 可能是模拟粒子的数量）从初始位置 $\vec{r}_i(0)$ 移动到最终位置 $\vec{r}_i(t_j)$。MSD 是该位移平方在所有粒子和多个时间原点上的平均值。标有“MSD”的图表显示了如何绘制这些数据并找到斜率（“拟合”）来计算 $D$。
Connection: This is a perfect summary of the “Displacement Picture” we analyzed on the second whiteboard. It’s the most intuitive way to think about diffusion: how far particles spread out over time.这完美地总结了我们在第二个白板上分析的“位移图”。这是思考扩散最直观的方式：粒子随时间扩散的距离。

3. The Green-Kubo Relation: The Fluctuation Picture 格林-久保关系：涨落图

Finally, the board presents the more advanced but often more practical method: the Green-Kubo relation.

The Equations: This section displays the two key equations from our last discussion:
1. The MSD as the double integral of the Velocity Autocorrelation Function (VACF). 速度自相关函数 (VACF) 的二重积分的均方差 (MSD)。
2. The crucial derivative step: $\frac{d\langle x^2(t)\rangle}{dt} = 2 \int_0^t dt'' \langle V_x(t) V_x(t'') \rangle$. 关键的导数步骤：$\frac{d\langle x^2(t)\rangle}{dt} = 2 \int_0^t dt'' \langle V_x(t) V_x(t'') \rangle$。
The Diagram: The small diagram of a square with axes $t'$ and $t''$ visually represents the two-dimensional domain of integration for the double integral. 一个带有轴 $t'$ 和 $t''$ 的小正方形图直观地表示了二重积分的二维积分域。
Connection: This summarizes the “Fluctuation Picture.” It shows the mathematical heart of the derivation that proves the Einstein and Green-Kubo methods are equivalent. As we concluded, this method is often numerically superior because it involves integrating a rapidly decaying function (the VACF) rather than finding the slope of a noisy, unbounded function (the MSD). 这概括了“涨落图”。它展示了证明爱因斯坦方法和格林-久保方法等价的推导过程的数学核心。正如我们总结的那样，这种方法通常在数值上更胜一筹，因为它涉及对快速衰减函数（VACF）进行积分，而不是求噪声无界函数（MSD）的斜率。

In essence, this single whiteboard is a complete roadmap for analyzing diffusion in a molecular simulation. It shows how to first characterize the material’s structure ($g(r)$) and then how to compute its key dynamic property—the diffusion constant D—using two powerful, interconnected methods. 本质上，这块白板就是分子模拟中分析扩散的完整路线图。它展示了如何首先表征材料的结构（$g(r)$），然后如何使用两种强大且相互关联的方法计算其关键的动态特性——扩散常数 D。

This whiteboard beautifully concludes the derivation of the Green-Kubo relation, showing the final formulas and how they are used in practice. It provides the punchline to the mathematical story we’ve been following.

Let’s break down the details.

4. Finalizing the Derivation

The top lines of the board show the final step in connecting the Mean Squared Displacement (MSD) to the Velocity Autocorrelation Function (VACF).

\[\lim_{t\to\infty} \frac{d\langle x^2 \rangle}{dt} = 2 \int_0^\infty d\tau \langle V_x(0) V_x(\tau) \rangle\]

The Left Side: As we know from the Einstein relation, the long-time limit of the derivative of the 1D MSD, $\lim_{t\to\infty} \frac{d\langle x^2 \rangle}{dt}$, is simply equal to $2D$.
The Right Side: This is the result of the mathematical derivation from the previous slide. It shows that this same quantity is also equal to twice the total integral of the VACF.

By equating these two, we can solve for the diffusion coefficient, D.

5. The Velocity Autocorrelation Function (VACF)

The board explicitly names the key quantity here:

\[\Phi(\tau) = \langle V_x(0) V_x(\tau) \rangle\]

This is the “Velocity autocorrelation function” (abbreviated as VAF on the board), which we’ve denoted as VACF. The variable has been changed from t to τ (tau) to represent a “time lag” or interval, which is common notation.

The Plot: The graph on the board shows a typical plot of the VACF, $\Phi(\tau)$, versus the time lag $\tau$.
- It starts at a maximum positive value at $\tau=0$ (when the velocity is perfectly correlated with itself).
- It rapidly decays towards zero as the particle undergoes collisions that randomize its velocity.
The Integral: The shaded area under this curve represents the value of the integral $\int_0^\infty \Phi(\tau) d\tau$. The Green-Kubo formula states that the diffusion coefficient is directly proportional to this area.

6. The Green-Kubo Formulas for the Diffusion Coefficient

After canceling the factor of 2, the board presents the final, practical formulas for D.

In 1 Dimension: \[D = \int_0^\infty d\tau \langle V_x(0) V_x(\tau) \rangle\]
In 3 Dimensions: This is the more general and useful formula. \[D = \frac{1}{3} \int_0^\infty d\tau \langle \vec{v}(0) \cdot \vec{v}(\tau) \rangle\] There are two important changes for 3D:
1. We use the full velocity vectors and their dot product, $\vec{v}(0) \cdot \vec{v}(\tau)$, to capture motion in all directions.
2. We divide by 3 to get the average contribution to diffusion in any one direction (x, y, or z).

7. Practical Calculation in a Simulation

The last formula on the board shows how this is implemented in a computer simulation with a finite number of atoms.

\[D = \frac{1}{3N} \int_0^\infty d\tau \sum_{i=1}^{N} \langle \vec{v}_i(0) \cdot \vec{v}_i(\tau) \rangle\]

$\sum_{i=1}^{N}$: This summation symbol indicates that you must compute the VACF for each individual atom (from atom i=1 to atom N).
$\frac{1}{N}$: You then average the results over all N atoms in your simulation box.
$\langle \dots \rangle$: The angle brackets here still imply an additional average over multiple different starting times (t=0) to get good statistics.

This formula is the practical recipe: to get the diffusion coefficient, you track the velocity of every atom, calculate each one’s VACF, average them together, and then integrate the result over time.

Feature	Ridge (L2)	Lasso (L1)
Penalty	\(L_2\) norm: \(\lambda \sum \beta_j^2\)	\(L_1\) norm: \(\lambda \sum \|\beta_j\|\)
Coefficient Shrinkage	Proportional; shrinks all coefficients, but never to exactly zero.	Soft-thresholding; can force coefficients to be exactly zero.
Feature Selection?	No	Yes, this is its main advantage.
Interpretability	Less interpretable (keeps all \(p\) variables).	More interpretable (produces a “sparse” model with fewer variables).
Best Used When…	…most predictors are useful. (e.g., Slide 76: 45/45 relevant).	…many predictors are “noise” and only a few are strong. (e.g., Slide 78: 2/45 relevant).
Computation	Has a simple, closed-form solution.	Requires numerical optimization (e.g., coordinate descent).

Concept (from your slides)	`scikit-learn` Parameter	Description
\(\lambda\) (Lambda)	`alpha`	The overall strength of regularization.
\(\alpha\) (Alpha)	`l1_ratio`	The mixing parameter between L1 and L2.

1. 核心数学工具：高斯分布 (Gaussian Distribution)

1.1 定义

1.2 关键性质 1：重参数化技巧 (Reparameterization Trick)

1.3 关键性质 2：两个独立高斯分布之和 (Sum of Independent Gaussians)

2. 物理与SDE：朗之万动力学 (Langevin Dynamics)

2.1 随机微分方程 (SDE)

2.2 SDE 的离散化：欧拉-丸山法 (Euler-Maruyama)

3. 核心数学工具：贝叶斯公式 (Bayes’ Theorem)

2. DDPM 前向过程：从图像到噪声 (The Forward Process)

2.1 单步加噪过程 \(q(\mathbf{x}_t | \mathbf{x}_{t-1})\)

2.2 累积加噪过程 \(q(\mathbf{x}_t | \mathbf{x}_0)\) (核心推导)

2.3 前向过程的最终公式

3. 反向过程：从噪声到图像 (The Reverse Process)

3.1 棘手的目标 \(p(\mathbf{x}_{t-1} | \mathbf{x}_t)\)

3.2 DDPM 的核心创见：利用 \(\mathbf{x}_0\) (对应)

4. 训练：学习反向过程 (Training)

4.1 神经网络的目标

4.2 DDPM 的关键改进：预测噪声 (对应)

4.3 训练流程与损失函数 (对应-[01:24:40])

5. 推理：逐步去噪生成图像 (Inference)

6. DDIM：推理升级

6.1 DDIM 的核心洞察：重新审视反向过程

6.2 升级 1：确定性推理 (Deterministic Inference) (对应)

6.3 升级 2：跳步采样 (Skip Sampling) (对应)

7. 流匹配：连续轨迹与速度预测

7.1 核心思想：从 SDE 到 ODE

7.2 训练：学习速度场 (对应)

7.3 推理：求解 ODE (对应)

1. Linear Model Selection and Regularization 线性模型选择与正则化

Summary of Core Concepts

Mathematical Understanding & Key Questions 数学理解与关键问题

How to compare which model is better?

Why use \(R^2\) in Step 2?

Why can’t we use training error in Step 3?

Code Analysis

Key Functions

Analysis of the Output

Conceptual Overview: The “Why”

Questions 🎯

Q1: “How to compare which model is better?”

Q2: “Why using \(R^2\) for step 2?”

Q3: “Cannot use training error in Step 3.” Why not? “步骤 3 中不能使用训练误差。” 为什么？

Mathematical Deep Dive 🧮

Detailed Code Analysis 💻

Key Functions

Summary of Outputs (Slides ...221255.png & ...221309.png)

2. The Core Problem: Training Error vs. Test Error 核心问题：训练误差 vs. 测试误差

Basic Metrics (Measures of Fit)

Residue (Error) 残差（误差）

Residual Sum of Squares (RSS) 残差平方和 (RSS)

R-squared (\(R^2\))

Advanced Metrics (For Model Selection) 高级指标（用于模型选择）

Adjusted \(R^2\)

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

The Deeper Theory: Why AIC Works

AIC/BIC for Linear Regression

The Core Problem: Training Error vs. Test Error

Basic Metrics (Measures of Fit)

Residue (Error)

Residual Sum of Squares (RSS)

R-squared (\(R^2\))

Advanced Metrics (For Model Selection)

Adjusted \(R^2\)

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

The Deeper Theory: Why AIC Works

AIC/BIC for Linear Regression

3. Variable Selection

Core Concept: The Problem of Variable Selection

Method 1: Best Subset Selection (BSS)

Conceptual Algorithm

Mathematical & Computational Cost (from slide 225641.png)

Method 2: Forward Stepwise Selection (FSS)

Conceptual Algorithm (from slides 225645.png & 225648.png)

Mathematical & Computational Cost (from slide 225651.png)

4. How to Choose the “Best” Model: The Criteria

5. Python Code Analysis (Slide 225546.jpg)

Block 1: Load the Credit dataset

Block 2: Plot scatterplot matrix

Summary of Outputs (Slides `...221255.png` & `...221309.png`)

Mathematical & Computational Cost (from slide `225641.png`)

Conceptual Algorithm (from slides `225645.png` & `225648.png`)

Mathematical & Computational Cost (from slide `225651.png`)

5. Python Code Analysis (Slide `225546.jpg`)

Python (`scikit-learn`) Equivalents