0%

第 1 部分:基础数学工具与物理背景

最近在尝试理解扩散模型所依赖的核心数学概念:高斯分布的性质、朗之万动力学(SDE)及其离散化。

1. 核心数学工具:高斯分布 (Gaussian Distribution)

扩散模型(尤其是 DDPM)的全部推导都建立在高斯分布的美妙性质之上。

1.1 定义

一个一维高斯分布(正态分布)由均值 \(\mu\) 和方差 \(\sigma^2\) 定义,其概率密度函数 (PDF) 为: \[ \mathcal{N}(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \] 对于 \(D\) 维向量 \(\mathbf{x}\)(例如视频中 \(32 \times 32 = 1024\) 维的图片向量),多维高斯分布由均值向量 \(\boldsymbol{\mu}\) 和协方差矩阵 \(\boldsymbol{\Sigma}\) 定义: \[ \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2\pi)^D \det(\boldsymbol{\Sigma})}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right) \] > 关键简化: 在 DDPM 中,我们通常假设协方差矩阵是对角矩阵 \(\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}\),其中 \(\mathbf{I}\) 是单位矩阵。这意味着向量的每个维度(每个像素)是独立同分布的(IID)。

1.2 关键性质 1:重参数化技巧 (Reparameterization Trick)

问题: 我们如何从 \(\mathcal{N}(\mu, \sigma^2)\) 中采样?直接对这个分布采样是困难的,且在神经网络中无法传递梯度。

技巧: 我们可以从一个标准高斯分布 \(\epsilon \sim \mathcal{N}(0, 1)\) 中采样,然后通过线性变换得到 \(x\)\[ x = \mu + \sigma \cdot \epsilon \] 推导: * 1.:\(\epsilon \sim \mathcal{N}(0, 1)\),即 \(E[\epsilon] = 0, \text{Var}(\epsilon) = 1\)。 * 2.: 我们构造 \(x = \mu + \sigma \epsilon\)。 * 计算 \(x\) 的均值:\(E[x] = E[\mu + \sigma \epsilon] = E[\mu] + \sigma E[\epsilon] = \mu + \sigma \cdot 0 = \mu\)。 * 3.: 计算 \(x\) 的方差:\(\text{Var}(x) = \text{Var}(\mu + \sigma \epsilon) = \text{Var}(\sigma \epsilon) = \sigma^2 \text{Var}(\epsilon) = \sigma^2 \cdot 1 = \sigma^2\)。 * 4.: 由于高斯分布的线性变换仍然是高斯分布,因此 \(x \sim \mathcal{N}(\mu, \sigma^2)\)

意义: 这使得采样过程可微。\(\mu\)\(\sigma\) 可以是神经网络的输出,\(\epsilon\) 作为外部噪声输入,梯度可以反向传播回 \(\mu\)\(\sigma\)

1.3 关键性质 2:两个独立高斯分布之和 (Sum of Independent Gaussians)

问题: 两个独立高斯分布相加会怎样?这是前向加噪过程(Forward Process)的核心。

性质: 如果 \(X_1 \sim \mathcal{N}(\mu_1, \sigma_1^2)\)\(X_2 \sim \mathcal{N}(\mu_2, \sigma_2^2)\) 相互独立,那么它们的和 \(Y = X_1 + X_2\) 仍然是高斯分布: \[ Y \sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2) \] 推导: * 均值: \(E[Y] = E[X_1 + X_2] = E[X_1] + E[X_2] = \mu_1 + \mu_2\)。 * 方差: 由于 \(X_1\)\(X_2\) 相互独立,\(\text{Cov}(X_1, X_2) = 0\)\(\text{Var}(Y) = \text{Var}(X_1 + X_2) = \text{Var}(X_1) + \text{Var}(X_2) + 2 \text{Cov}(X_1, X_2) = \sigma_1^2 + \sigma_2^2\)。 * (两个独立高斯变量之和仍为高斯分布,这可以通过矩生成函数或特征函数严格证明,这里我们接受这个结论。)

意义: 这使得我们可以计算前向过程中任意 \(t\) 时刻 \(x_t\)\(x_0\) 开始累积加噪的结果。

2. 物理与SDE:朗之万动力学 (Langevin Dynamics)

这是扩散模型的物理原型。

2.1 随机微分方程 (SDE)

朗之万动力学描述了一个粒子(在我们的例子中是图片向量 \(\mathbf{x}\))在势场 \(U(\mathbf{x})\) 中运动,同时受到随机力(如布朗运动)的影响。其 SDE 形式 为: \[ d\mathbf{x}_t = \mathbf{f}(\mathbf{x}_t, t)dt + g(t)d\mathbf{w}_t \] * \(\mathbf{x}_t\)\(t\) 时刻的粒子位置(图片向量)。 * \(\mathbf{f}(\mathbf{x}_t, t)\)漂移项 (Drift)。代表确定性的力,如视频中 提到的 “吸引回原点的线性运动”(例如 \(\mathbf{f}(\mathbf{x}, t) = -\beta \mathbf{x}\),使分布趋向原点)。它对应能量函数的负梯度 \(-\nabla U(\mathbf{x})\)。 * \(g(t)\)扩散项 (Diffusion)。控制随机噪声的强度。 * \(d\mathbf{w}_t\)维纳过程 (Wiener Process) 或布朗运动。它是一个随机项,其增量 \(d\mathbf{w}_t\)\(dt\) 时间内服从高斯分布 \(d\mathbf{w}_t \sim \mathcal{N}(0, \mathbf{I} dt)\)

2.2 SDE 的离散化:欧拉-丸山法 (Euler-Maruyama)

问题: 计算机无法处理连续时间 \(dt\)。我们如何模拟这个 SDE?

方法: 我们使用欧拉近似法(在 SDE 中称为 Euler-Maruyama)将其离散化为小的时间步 \(\Delta t\)\[ \mathbf{x}_{t+\Delta t} - \mathbf{x}_t \approx \mathbf{f}(\mathbf{x}_t, t)\Delta t + g(t) (\mathbf{w}_{t+\Delta t} - \mathbf{w}_t) \] 根据维纳过程的性质,在 \(\Delta t\) 时间内的增量 \((\mathbf{w}_{t+\Delta t} - \mathbf{w}_t)\) 服从 \(\mathcal{N}(0, \mathbf{I} \Delta t)\)。 根据性质 1.2(重参数化)\(\mathcal{N}(0, \mathbf{I} \Delta t)\) 可以写成 \(\sqrt{\Delta t} \cdot \mathbf{z}\),其中 \(\mathbf{z} \sim \mathcal{N}(0, \mathbf{I})\)

离散迭代公式: \[ \mathbf{x}_{t+\Delta t} \approx \mathbf{x}_t + \mathbf{f}(\mathbf{x}_t, t)\Delta t + g(t) \sqrt{\Delta t} \mathbf{z}_t \] 其中 \(\mathbf{z}_t \sim \mathcal{N}(0, \mathbf{I})\) 是在 \(t\) 时刻采样的标准高斯噪声。

这就是 DDPM 前向加噪过程 (Forward Process) 的数学原型。

3. 核心数学工具:贝叶斯公式 (Bayes’ Theorem)

贝叶斯公式是连接前向过程(加噪)和反向过程(去噪)的桥梁。

对于连续变量(概率密度函数): \[ p(x|y) = \frac{p(y|x) p(x)}{p(y)} \] 其中 \(p(y) = \int p(y|x) p(x) dx\)

在扩散模型中的应用: * 1.: 我们定义了一个简单的前向加噪过程 \(q(x_t | x_{t-1})\)(易于计算)。 * 2.: 我们想要的是反向去噪过程 \(p(x_{t-1} | x_t)\)(难以计算)。 * 3.: 贝叶斯公式告诉我们:\(p(x_{t-1} | x_t) \propto p(x_t | x_{t-1}) p(x_{t-1})\)。 * 4.: 在 DDPM 中,我们会看到一个更复杂的形式,它利用了 \(x_0\)\[ q(x_{t-1} | x_t, x_0) = \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1} | x_0)}{q(x_t | x_0)} \]

小结

  1. 高斯分布的性质(重参数化、加法),能精确计算加噪后的分布。
  2. 朗之万动力学与欧拉近似,为 “逐步加噪” 提供了物理和数学模型。
  3. 贝叶斯公式,指明了如何从 “加噪” 倒推出 “去噪”。

2. DDPM 前向过程:从图像到噪声 (The Forward Process)

前向过程的目标是模拟大纲 中描述的“图片在向量空间中逐步噪声化的轨迹”。我们定义一个马尔可夫过程,在该过程中,我们从原始数据 \(\mathbf{x}_0 \sim q(\mathbf{x}_0)\)(即真实图片分布)开始,在 \(T\) 个离散的时间步中逐步向其添加高斯噪声。

2.1 单步加噪过程 \(q(\mathbf{x}_t | \mathbf{x}_{t-1})\)

在每个 \(t\) 步,我们向 \(\mathbf{x}_{t-1}\) 添加少量噪声以生成 \(\mathbf{x}_t\)。这个过程被定义为一个高斯转变(这源于我们第 1 部分中对朗之万动力学的离散化):

\[ q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) \]

  • \(\{\beta_t\}_{t=1}^T\) 是一个预先设定的方差表 (variance schedule)。它们是 \(T\) 个很小的正常数(例如,\(\beta_1 = 10^{-4}, \beta_T = 0.02\))。
  • \(\sqrt{1 - \beta_t} \mathbf{x}_{t-1}\):这是缩放项(大纲)。我们在添加噪声之前先将前一步的图片向量“缩小”一点。
  • \(\beta_t \mathbf{I}\):这是噪声项\(\beta_t\) 是添加的噪声的方差,\(\mathbf{I}\) 是单位矩阵,表示噪声在所有维度(像素)上是独立同分布的。

重参数化技巧的应用: 我们可以使用第 1 部分中的重参数化技巧来显式地写出这个采样过程: \[ \mathbf{x}_t = \sqrt{1 - \beta_t} \mathbf{x}_{t-1} + \sqrt{\beta_t} \boldsymbol{\epsilon}_{t-1} \] 其中 \(\boldsymbol{\epsilon}_{t-1} \sim \mathcal{N}(0, \mathbf{I})\) 是在 \(t-1\) 时刻采样的一个标准高斯噪声。

2.2 累积加噪过程 \(q(\mathbf{x}_t | \mathbf{x}_0)\) (核心推导)

问题: 在训练期间(如大纲 所述),我们希望随机跳到任意 \(t\) 步并生成 \(\mathbf{x}_t\)。如果我们必须从 \(\mathbf{x}_0\) 迭代 \(t\) 次,这将非常缓慢。

目标: 我们需要一个公式,能让我们从 \(\mathbf{x}_0\) 一次性得到 \(\mathbf{x}_t\) 的分布 \(q(\mathbf{x}_t | \mathbf{x}_0)\)。这就是大纲 中提到的“构造出高斯分布的累计变换”。

推导: 1. 定义新变量: 为了简化推导,我们定义 \(\alpha_t = 1 - \beta_t\)\(\bar{\alpha}_t = \prod_{i=1}^t \alpha_i\)。 * \(\alpha_t\) 是每一步的缩放因子。 * \(\bar{\alpha}_t\) 是从第 1 步到第 \(t\) 步的累积缩放因子

  1. 展开迭代 (Step-by-step Expansion): 让我们从 \(\mathbf{x}_t\) 开始,逐步代入 \(\mathbf{x}_{t-1}\)\[ \mathbf{x}_t = \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1} \] 现在,我们代入 \(\mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \boldsymbol{\epsilon}_{t-2}\)\[ \begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t} (\sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \boldsymbol{\epsilon}_{t-2}) + \sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{\alpha_t(1 - \alpha_{t-1})} \boldsymbol{\epsilon}_{t-2} + \sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1} \end{aligned} \]

  2. 合并高斯噪声 (Merging Gaussians): 注意上式的后两项:\(\sqrt{\alpha_t(1 - \alpha_{t-1})} \boldsymbol{\epsilon}_{t-2}\)\(\sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1}\)

    • \(\boldsymbol{\epsilon}_{t-2}\)\(\boldsymbol{\epsilon}_{t-1}\) 是两个独立的标准高斯分布。
    • 我们正在对两个独立的、均值为 0 的高斯分布进行线性组合。
    • 根据第 1 部分的性质 1.3 (两个独立高斯分布之和),它们的和仍然是一个均值为 0 的高斯分布。
    • 这个新的高斯分布的方差是多少? \[ \begin{aligned} \text{Var}(\text{new\_noise}) &= \text{Var}(\sqrt{\alpha_t(1 - \alpha_{t-1})} \boldsymbol{\epsilon}_{t-2}) + \text{Var}(\sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1}) \\ &= (\alpha_t(1 - \alpha_{t-1})) \mathbf{I} + (1 - \alpha_t) \mathbf{I} \\ &= (\alpha_t - \alpha_t\alpha_{t-1} + 1 - \alpha_t) \mathbf{I} \\ &= (1 - \alpha_t\alpha_{t-1}) \mathbf{I} \end{aligned} \]
    • 根据重参数化技巧,一个方差为 \((1 - \alpha_t\alpha_{t-1}) \mathbf{I}\) 的高斯分布,可以写成 \(\sqrt{1 - \alpha_t\alpha_{t-1}} \cdot \bar{\boldsymbol{\epsilon}}_{t-2}\),其中 \(\bar{\boldsymbol{\epsilon}}_{t-2} \sim \mathcal{N}(0, \mathbf{I})\) 是一个新的标准高斯噪声。
  3. 递归与通项公式: 我们将合并后的噪声代入第 2 步的展开式: \[ \mathbf{x}_t = \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} \]

    • \(\mathbf{x}_t\)\(\mathbf{x}_{t-2}\) 的关系,与 \(\mathbf{x}_t\)\(\mathbf{x}_{t-1}\) 的关系(\(\mathbf{x}_t = \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \boldsymbol{\epsilon}_{t-1}\))在形式上是完全一致的!只是 \(\alpha_t\) 变成了 \(\alpha_t \alpha_{t-1}\)
    • 我们可以将这个模式递归地应用 \(t\) 次: \[ \begin{aligned} \mathbf{x}_t &= \sqrt{(\alpha_t \alpha_{t-1} \cdots \alpha_1)} \mathbf{x}_0 + \sqrt{1 - (\alpha_t \alpha_{t-1} \cdots \alpha_1)} \boldsymbol{\epsilon} \\ &= \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} \end{aligned} \]
    • 其中 \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\) 是一个(合并了 \(t\) 次的)标准高斯噪声。

2.3 前向过程的最终公式

我们得到了前向过程中最关键的累积加噪公式\[ q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I}) \] 这个公式的重参数化形式为: \[ \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} \] 意义: * 训练效率: 这个公式是 DDPM 训练效率的关键。我们不需要迭代 \(t\) 次来生成 \(\mathbf{x}_t\)。 * 随机训练: 在训练神经网络时,我们可以: 1. 从数据集中拿一张清晰图片 \(\mathbf{x}_0\)。 2. 随机选择一个时间步 \(t\)(例如 \(t=150\))。 3. 从 \(\mathcal{N}(0, \mathbf{I})\) 中采样一个噪声 \(\boldsymbol{\epsilon}\)。 4. 使用上述公式一步计算出 \(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\)。 5. 将 \((\mathbf{x}_t, t, \boldsymbol{\epsilon})\) 喂给神经网络进行训练。

\(t \to T\) 时(例如 \(T=1000\)),\(\bar{\alpha}_T = \prod_{i=1}^T (1 - \beta_i)\)。由于所有的 \(\beta_i > 0\)\(\bar{\alpha}_T\) 会非常接近 0。 此时: \[ \mathbf{x}_T \approx \sqrt{0} \mathbf{x}_0 + \sqrt{1 - 0} \boldsymbol{\epsilon} = \boldsymbol{\epsilon} \] 这意味着,在 \(T\) 步之后,\(\mathbf{x}_T\) 的分布 \(q(\mathbf{x}_T | \mathbf{x}_0) \approx \mathcal{N}(0, \mathbf{I})\),它几乎完全变成了标准高斯噪声,并且与 \(\mathbf{x}_0\) 无关。

成功地将复杂的图片分布 \(q(\mathbf{x}_0)\) 转化为了简单的标准高斯分布 \(q(\mathbf{x}_T)\)

小结

核心问题:如何逆转这个过程?如何从一张纯噪声图片 \(\mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I})\) 出发,一步步去噪,最终得到一张清晰的图片 \(\mathbf{x}_0\)

这需要推导反向去噪过程 (Reverse Process) \(p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)\)

3. 反向过程:从噪声到图像 (The Reverse Process)

我们的目标是学习反向的马尔可夫链,即 \(p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)\)

3.1 棘手的目标 \(p(\mathbf{x}_{t-1} | \mathbf{x}_t)\)

我们想从 \(\mathbf{x}_t\) 推导出 \(\mathbf{x}_{t-1}\)。根据贝叶斯公式: \[ p(\mathbf{x}_{t-1} | \mathbf{x}_t) = \frac{p(\mathbf{x}_t | \mathbf{x}_{t-1}) p(\mathbf{x}_{t-1})}{p(\mathbf{x}_t)} \] * \(p(\mathbf{x}_t | \mathbf{x}_{t-1})\) 就是前向过程 \(q(\mathbf{x}_t | \mathbf{x}_{t-1})\),我们已知。 * \(p(\mathbf{x}_{t-1})\)\(t-1\) 时刻的边缘分布,需要对所有 \(\mathbf{x}_0\) 积分 \(p(\mathbf{x}_{t-1}) = \int q(\mathbf{x}_{t-1} | \mathbf{x}_0) q(\mathbf{x}_0) d\mathbf{x}_0\),这依赖于 \(q(\mathbf{x}_0)\)(真实数据分布),极其困难 (intractable)。 * \(p(\mathbf{x}_t)\) 同样难以计算。

3.2 DDPM 的核心创见:利用 \(\mathbf{x}_0\) (对应)

关键洞察: 虽然 \(p(\mathbf{x}_{t-1} | \mathbf{x}_t)\) 难以计算,但如果我们额外知道 \(\mathbf{x}_0\),这个后验分布 \(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)\) 是可计算的

为什么?因为我们定义了所有 \(q\) 的前向步骤。我们再次使用贝叶斯公式 (对应): \[ q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} \] 利用马尔可夫性质 \(q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) = q(\mathbf{x}_t | \mathbf{x}_{t-1})\),我们得到: \[ q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \propto q(\mathbf{x}_t | \mathbf{x}_{t-1}) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0) \] 我们已知这三个分布都是高斯分布(在第 2 部分已推导): 1. \(q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})\) 2. \(q(\mathbf{x}_{t-1} | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1 - \bar{\alpha}_{t-1}) \mathbf{I})\) 3. \(q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})\)

我们正在用 \(\mathbf{x}_{t-1}\) 作为变量,乘以两个高斯分布的概率密度函数 (PDF)。高斯 PDF 的形式是 \(C \cdot \exp(-\frac{(x - \mu)^2}{2\sigma^2})\)。两个高斯 PDF 相乘的结果仍然是一个高斯分布

通过匹配 \(\mathbf{x}_{t-1}\) 的一次项和二次项系数(一个繁琐但直接的代数过程),我们可以解出这个新高斯分布的均值 \(\tilde{\boldsymbol{\mu}}_t\) 和方差 \(\tilde{\beta}_t\)

\[ q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I}) \] 其中: * 方差 \(\tilde{\beta}_t\):不依赖于 \(\mathbf{x}_t\)\(\mathbf{x}_0\),它是一个固定的超参数。 \[ \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t \] * 均值 \(\tilde{\boldsymbol{\mu}}_t\):依赖于 \(\mathbf{x}_t\)\(\mathbf{x}_0\)\[ \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t \]

4. 训练:学习反向过程 (Training)

4.1 神经网络的目标

我们有了一个完美的目标分布 \(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)\)。但它有个问题:在推理 (Inference) 时,我们从 \(\mathbf{x}_T\) 开始,并不知道 \(\mathbf{x}_0\)

因此,我们训练一个神经网络 \(p_\theta\)近似这个分布: \[ p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \] 我们的目标是让 \(p_\theta\) 尽可能接近 \(q\)。 * 简化1 (固定方差):DDPM 论文发现,将神经网络的方差 \(\boldsymbol{\Sigma}_\theta\) 固定为 \(\tilde{\beta}_t \mathbf{I}\)\(\beta_t \mathbf{I}\) 效果最好。这极大地简化了问题:神经网络只需要学习均值 \(\boldsymbol{\mu}_\theta\)。 * 简化2 (学习目标):我们训练 \(\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\) 来预测真实均值 \(\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)\)

4.2 DDPM 的关键改进:预测噪声 (对应)

\(\tilde{\boldsymbol{\mu}}_t\) 的公式 \(\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t\) 仍然很复杂。

DDPM 论文提出了一个重要的的重参数化: 我们回顾第 2 部分的前向公式:\(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\) 我们可以用它来反解 \(\mathbf{x}_0\)(在 \(\mathbf{x}_t\)\(\boldsymbol{\epsilon}\) 已知的情况下): \[ \mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} (\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}) \] 现在,我们将这个 \(\mathbf{x}_0\) 的表达式代入上面 \(\tilde{\boldsymbol{\mu}}_t\) 的复杂公式中: \[ \begin{aligned} \tilde{\boldsymbol{\mu}}_t &= \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \left( \frac{1}{\sqrt{\bar{\alpha}_t}} (\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}) \right) + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t \\ &= \left( \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{(1 - \bar{\alpha}_t)\sqrt{\bar{\alpha}_t}} + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \right) \mathbf{x}_t - \left( \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t \sqrt{1 - \bar{\alpha}_t}}{(1 - \bar{\alpha}_t)\sqrt{\bar{\alpha}_t}} \right) \boldsymbol{\epsilon} \end{aligned} \] (经过一系列基于 \(\bar{\alpha}_t = \alpha_t \bar{\alpha}_{t-1}\)\(\beta_t = 1 - \alpha_t\) 的代数化简) \[ \tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon} \right) \] 分析这个优美的公式: * \(\alpha_t\), \(\beta_t\), \(\bar{\alpha}_t\) 都是预先设定的超参数。 * \(\mathbf{x}_t\) 是神经网络的输入。 * 唯一未知的就是 \(\boldsymbol{\epsilon}\) —— 那个在第 2 部分用于从 \(\mathbf{x}_0\) 生成 \(\mathbf{x}_t\)原始噪声

结论(DDPM 核心思想): (对应) 与其让神经网络 \(\boldsymbol{\mu}_\theta\) 预测那个复杂的均值 \(\tilde{\boldsymbol{\mu}}_t\),我们可以让它转而去预测这个噪声 \(\boldsymbol{\epsilon}\)。 我们定义一个神经网络(通常是 U-Net 结构)\(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\),它的目标就是预测 \(\boldsymbol{\epsilon}\)

4.3 训练流程与损失函数 (对应-[01:24:40])

  1. 从数据集中随机抽取一张清晰图像 \(\mathbf{x}_0\)
  2. 随机选择一个时间步 \(t\)(从 1 到 \(T\))。(对应 随机训练)
  3. 随机采样一个标准高斯噪声 \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\)。(对应)
  4. 使用前向公式一步生成加噪图像:\(\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\)
  5. \((\mathbf{x}_t, t)\) 作为输入,喂给神经网络 \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\),得到预测噪声 \(\boldsymbol{\epsilon}_\theta\)
  6. 计算损失函数(均方误差 MSE):(对应) \[ L(\theta) = E_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ || \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) ||^2 \right] \]
  7. 使用梯度下降更新网络参数 \(\theta\)

5. 推理:逐步去噪生成图像 (Inference)

当训练好 \(\boldsymbol{\epsilon}_\theta\) 后,就可以从纯噪声生成图像了:

  1. 起始: 从标准高斯分布中采样一张纯噪声图像 \(\mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I})\)
  2. 迭代: \(t = T\) 循环到 \(t = 1\)
    1. 将当前的 \(\mathbf{x}_t\) 和时间步 \(t\) 输入网络,得到噪声预测:\(\boldsymbol{\epsilon}_\theta = \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\)。(对应)
    2. 使用 \(\boldsymbol{\epsilon}_\theta\) 作为我们对 \(\boldsymbol{\epsilon}\) 的最佳估计,代入 4.2 中的均值公式,计算 \(t-1\) 步的均值 \(\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\):(对应) \[ \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta \right) \]
    3. 计算 \(t-1\) 步的方差。我们使用固定的方差 \(\sigma_t^2 \mathbf{I} = \tilde{\beta}_t \mathbf{I} = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t \mathbf{I}\)
    4. 采样 (Sampling) (对应):从这个高斯分布中采样 \(\mathbf{x}_{t-1}\)\[ \mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z} \] 其中 \(\mathbf{z} \sim \mathcal{N}(0, \mathbf{I})\) 是一个新采样的随机噪声。 (注意:当 \(t=1\) 时,\(\mathbf{z}\) 设为 0,因为 \(\mathbf{x}_0\) 应该是一个确定性的输出,不再添加噪声)。
  3. 结束: 当循环结束时,\(\mathbf{x}_0\) 就是生成的清晰图像。(对应)

小结

完整地推导了 DDPM 的核心数学原理: 1. 前向过程 \(q\):使用 \(q(\mathbf{x}_t | \mathbf{x}_0)\) 高效加噪。 2. 反向过程 \(p_\theta\):通过贝叶斯公式推导出 \(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)\) 作为理想目标。 3. 训练 \(p_\theta\):通过让 \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) 预测真实噪声 \(\boldsymbol{\epsilon}\) 来简化训练目标 (MSE Loss)。 4. 推理 \(p_\theta\):从 \(\mathbf{x}_T\) 开始,利用 \(\boldsymbol{\epsilon}_\theta\) 预测的均值,逐步采样 \(\mathbf{x}_{t-1}\)

DDPM 的一个主要缺点是推理速度慢(需要 \(T\) 步,例如 1000 步)。

DDIM (Denoising Diffusion Implicit Models)

DDPM 的效果很好,但它有两个主要缺点: 1. 推理速度慢: 它是一个马尔可夫过程,从 \(\mathbf{x}_T\) 生成 \(\mathbf{x}_0\) 必须执行 \(T\) 步(例如 1000 步)采样,非常耗时。 2. 推理是随机的: (Stochastic) 在每一步采样 \(\mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}\) 时,都需要加入一个新的随机噪声 \(\mathbf{z}\)。这意味着即使从同一个 \(\mathbf{x}_T\) 出发,两次推理也会得到不同的 \(\mathbf{x}_0\)。这对于需要一致性的任务(如图像编辑)来说是个问题。

DDIM(2020年提出)巧妙地解决了这两个问题,并且无需重新训练在 DDPM 上训练好的模型。

这对应于您大纲中的 - 部分。

6. DDIM:推理升级

6.1 DDIM 的核心洞察:重新审视反向过程

DDIM 的出发点是重新审视我们推导出的反向过程。DDPM 假设反向过程是 \(p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)\),并用它来近似 \(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)\)

DDIM 注意到,我们训练的神经网络 \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) 实际上是在预测噪声 \(\boldsymbol{\epsilon}\)。 回顾我们的前向公式: \[ \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} \] 既然我们有了 \(\mathbf{x}_t\)(当前输入)和 \(\boldsymbol{\epsilon}_\theta\)(网络预测的 \(\boldsymbol{\epsilon}\)),我们可以直接反解出对清晰图像 \(\mathbf{x}_0\) 的预测,我们称之为 \(\hat{\mathbf{x}}_0\)

\[ \hat{\mathbf{x}}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) \] 这个 \(\hat{\mathbf{x}}_0\) 是给定 \(\mathbf{x}_t\) 时,模型对最终结果 \(\mathbf{x}_0\) 的“最佳猜测”。

6.2 升级 1:确定性推理 (Deterministic Inference) (对应)

DDPM 的采样公式为: \[ \mathbf{x}_{t-1} = \underbrace{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)}_{\text{均值}} + \underbrace{\sigma_t \mathbf{z}}_{\text{随机噪声}} \] DDIM 提出,这个过程不一定是随机的。DDIM 引入了一个新的参数 \(\eta\) (eta) 来控制随机性。 * 当 \(\eta=1\) 时,DDIM 的采样过程与 DDPM 完全相同(随机)。 * 当 \(\eta=0\) 时,采样过程中的方差 \(\sigma_t\) 被设为 0。

\(\eta=0\)(方差 \(\sigma_t=0\))时,采样步骤变为: \[ \mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) \] 这是确定性的! 没有随机噪声 \(\mathbf{z}\) 的介入。

这有什么用? 这意味着从一个固定的 \(\mathbf{x}_T\) 出发,无论运行多少次,总会生成完全相同的 \(\mathbf{x}_0\)。这使得扩散模型可用于图像编辑、风格转换等需要保持一致性的任务。

DDIM 论文推导出了一个更通用的采样公式,它不依赖于 \(\boldsymbol{\mu}_\theta\) 而是直接使用 \(\hat{\mathbf{x}}_0\)\(\boldsymbol{\epsilon}_\theta\)。 当 \(\eta=0\) (即 \(\sigma_t = 0\)) 时,从 \(\mathbf{x}_t\)\(\mathbf{x}_{t-1}\)确定性采样公式为:

\[ \mathbf{x}_{t-1} = \underbrace{\sqrt{\bar{\alpha}_{t-1}} \hat{\mathbf{x}}_0}_{\text{指向“预测的” } \mathbf{x}_0} + \underbrace{\sqrt{1 - \bar{\alpha}_{t-1}} \cdot \left( \frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \hat{\mathbf{x}}_0}{\sqrt{1 - \bar{\alpha}_t}} \right)}_{\text{指向“当前的” } \mathbf{x}_t \text{ 的方向}} \] (注意:\(\frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \hat{\mathbf{x}}_0}{\sqrt{1 - \bar{\alpha}_t}}\) 正好等于 \(\boldsymbol{\epsilon}_\theta\) ) 所以,确定性(\(\eta=0\))的 DDIM 采样步骤也可以写为: \[ \mathbf{x}_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \hat{\mathbf{x}}_0 + \sqrt{1 - \bar{\alpha}_{t-1}} \cdot \boldsymbol{\epsilon}_\theta \]

6.3 升级 2:跳步采样 (Skip Sampling) (对应)

DDPM 必须一步一步 \(t \to t-1 \to t-2 \dots\) 地采样,因为它是马尔可夫过程。

DDIM 的采样公式(如上所示)是非马尔可夫的。它不依赖于 \(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)\),而是直接使用在 \(t\) 时刻预测的 \(\hat{\mathbf{x}}_0\) 来计算 \(\mathbf{x}_{t-1}\)

关键洞察: 既然我们能从 \(\mathbf{x}_t\) 预测出 \(\hat{\mathbf{x}}_0\),我们不仅能计算 \(\mathbf{x}_{t-1}\),我们能计算任意 \(\mathbf{x}_{\tau}\) (其中 \(\tau < t\))。

这使得跳步采样成为可能。我们不再需要完整的 \(T=1000\) 步,我们可以定义一个更短的子序列,例如 \(S=20\) 步: \((\tau_1, \tau_2, \dots, \tau_S) = (1, 51, 101, \dots, 951)\)

我们的推理循环不再是 for t in (T...1),而是 for i in (S...1): * 当前步\(\tau_i\) (例如 \(\tau_{20} = 951\)) * 目标步\(\tau_{i-1}\) (例如 \(\tau_{19} = 901\))

DDIM 跳步采样(确定性)公式: (对应)

  1. 输入: 当前噪声图像 \(\mathbf{x}_{\tau_i}\) 和时间步 \(\tau_i\)
  2. 预测 \(\hat{\mathbf{x}}_0\) (与之前相同) \[ \hat{\mathbf{x}}_0 = \frac{1}{\sqrt{\bar{\alpha}_{\tau_i}}} \left( \mathbf{x}_{\tau_i} - \sqrt{1 - \bar{\alpha}_{\tau_i}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_{\tau_i}, \tau_i) \right) \]
  3. 计算 \(\mathbf{x}_{\tau_{i-1}}\) (使用确定性公式,将 \(t-1\) 替换为 \(\tau_{i-1}\)) \[ \mathbf{x}_{\tau_{i-1}} = \sqrt{\bar{\alpha}_{\tau_{i-1}}} \hat{\mathbf{x}}_0 + \sqrt{1 - \bar{\alpha}_{\tau_{i-1}}} \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_{\tau_i}, \tau_i) \]

结果: 我们不再需要 1000 步计算,而是通过跳步,仅用 20 步就完成了从 \(\mathbf{x}_T\)\(\mathbf{x}_0\) 的生成。这极大地(例如 50 倍)提升了推理速度。

小结

DDIM 是对 DDPM 的一次重大升级,它通过引入 \(\hat{\mathbf{x}}_0\) 预测和非马尔可夫采样,实现了: 1. 确定性推理\(\eta=0\)),增强了模型的可控性。 2. 跳步采样,极大缩短了推理时间。 3. 最重要的是,它复用了 DDPM 训练好的 \(\boldsymbol{\epsilon}_\theta\) 模型,无需额外训练。

好的,我们来探讨扩散模型演进的下一个重要阶段:流匹配 (Flow Matching)

在 DDPM 和 DDIM 中,我们都依赖于一个离散时间的 SDE(随机微分方程)或其确定性版本。我们模拟了 \(T\) 个离散步骤(例如 \(T=1000\)),这在数学上是有效的,但在概念上有些繁琐,并且依赖于 \(\beta_t\) (或 \(\bar{\alpha}_t\)) 这个人工设计的“噪声表”。

流匹配 (Flow Matching, FM) 模型(2022年及后续工作)提出了一种更简洁、更根本的视角:连续时间常微分方程 (ODE)

这对应于您大纲中的 - 部分。

7. 流匹配:连续轨迹与速度预测

7.1 核心思想:从 SDE 到 ODE

  • DDPM (SDE):将图像 \(\mathbf{x}_0\) 变为噪声 \(\mathbf{x}_T\) 的过程是随机的(\(\mathbf{x}_t = \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{\beta_t} \boldsymbol{\epsilon}\))。
  • 流匹配 (ODE):我们构建一个确定性的连续的“流”,将纯噪声 \(\mathbf{z}\)(我们这里称为 \(\mathbf{x}_0\))平滑地转变为清晰图像 \(\mathbf{x}_1\)

我们不再考虑离散步骤 \(t=1, 2, \dots, T\),而是考虑一个连续时间 \(t \in [0, 1]\)

  • \(t=0\)\(\mathbf{x}_0\) 是从 \(\mathcal{N}(0, \mathbf{I})\) 采样的纯噪声。
  • \(t=1\)\(\mathbf{x}_1\) 是我们想要生成的清晰图像。

这个从 \(\mathbf{x}_0\)\(\mathbf{x}_1\) 的连续演变路径 \(\mathbf{x}_t\) 由一个常微分方程 (ODE) 描述:

\[ \frac{d\mathbf{x}_t}{dt} = \mathbf{v}(\mathbf{x}_t, t) \] * \(\mathbf{v}(\mathbf{x}_t, t)\) 是一个速度向量场 (velocity vector field)。 * 它告诉我们:当一个点位于位置 \(\mathbf{x}_t\) 和时间 \(t\) 时,它应该往哪个方向(向量)以多快的速度(模长)移动。 * 训练目标: 我们的神经网络 \(\mathbf{v}_\theta(\mathbf{x}_t, t)\) 的目标就是学习这个速度场 \(\mathbf{v}\),而不是像 DDPM 那样学习噪声 \(\boldsymbol{\epsilon}\)

7.2 训练:学习速度场 (对应)

问题: 理论上存在一个理想的速度场 \(\mathbf{v}\) 可以将噪声分布“推向”图像分布,但这个理想的 \(\mathbf{v}\) 非常复杂,我们无法知道。

流匹配的创见: 我们不需要知道那个复杂的理想场。我们可以自己定义无数条简单的路径,然后训练网络来学习这些简单路径的平均速度。

1. 定义简单路径(直线模型): 给定一个噪声 \(\mathbf{x}_0 \sim \mathcal{N}(0, \mathbf{I})\) 和一张真实图像 \(\mathbf{x}_1 \sim q(\text{data})\),连接它们的最简单路径是什么?一条直线

\[\\mathbf{x}\_t = (1 - t) \\mathbf{x}\_0 + t \\mathbf{x}\_1 \] * 当 \(t=0\) 时,\(\mathbf{x}_t = \mathbf{x}_0\) (噪声)。

  • \(t=1\) 时,\(\mathbf{x}_t = \mathbf{x}_1\) (图像)。

2. 计算目标速度: 如果我们的“粒子” \(\mathbf{x}_t\) 沿着这条直线路径运动,它在 \(t\) 时刻的速度 \(\mathbf{v}_t\) 是多少?我们对 \(t\) 求导:

\[ \mathbf{v}_t = \frac{d\mathbf{x}_t}{dt} = \frac{d}{dt} \left( (1 - t) \mathbf{x}_0 + t \mathbf{x}_1 \right) \]\[ \mathbf{v}_t = -\mathbf{x}_0 + \mathbf{x}_1 = \mathbf{x}_1 - \mathbf{x}_0 \]这就是流匹配的训练目标! 沿着这条直线路径,在任何时间 \(t\),目标速度都是恒定的 \(\mathbf{x}_1 - \mathbf{x}_0\)

3. 训练流程:

  1. 从数据集中随机抽取一张清晰图像 \(\mathbf{x}_1\)
  2. 随机采样一个标准高斯噪声 \(\mathbf{x}_0 \sim \mathcal{N}(0, \mathbf{I})\)
  3. 随机选择一个时间 \(t\)(从 \(U(0, 1)\) 均匀采样)。
  4. 使用直线公式一步计算出路径上的点:\(\mathbf{x}_t = (1 - t) \mathbf{x}_0 + t \mathbf{x}_1\)
  5. \((\mathbf{x}_t, t)\) 作为输入,喂给神经网络 \(\mathbf{v}_\theta(\mathbf{x}_t, t)\),得到预测速度 \(\mathbf{v}_\theta\)
  6. 计算损失函数(均方误差 MSE): $$
1
2
3
$$L(\\theta) = E\_{t, \\mathbf{x}\_0, \\mathbf{x}\_1} \\left[ || (\\mathbf{x}\_1 - \\mathbf{x}*0) - \\mathbf{v}*\\theta(\\mathbf{x}\_t, t) ||^2 \\right]
$$
$$
  1. 使用梯度下降更新网络参数 \(\theta\)

7.3 推理:求解 ODE (对应)

当我们训练好 \(\mathbf{v}_\theta\) 后,我们就有了一个完整的速度场,它知道在时空中的任何点 \((\mathbf{x}, t)\) 应该如何移动。

问题: 如何从 \(\mathbf{x}_0\) 积分到 \(\mathbf{x}_1\)? 我们需要求解 ODE \(\frac{d\mathbf{x}_t}{dt} = \mathbf{v}_\theta(\mathbf{x}_t, t)\),从 \(t=0\) 求解到 \(t=1\)

方法: 我们使用数值积分方法,最简单的就是欧拉近似法(我们在第 1 部分 提到过)。

  1. 起始: 随机采样一张纯噪声图像 \(\mathbf{x}_0 \sim \mathcal{N}(0, \mathbf{I})\)
  2. 离散化: 将时间 \([0, 1]\) 分为 \(N\) 步(例如 \(N=20\)),每一步 \(\Delta t = 1/N\)
  3. 迭代: \(t = 0\) 循环到 \(t = 1 - \Delta t\)
    1. 获取当前位置 \(\mathbf{x}_t\) 和时间 \(t\)
    2. 输入网络,得到当前速度:\(\mathbf{v}_\theta = \mathbf{v}_\theta(\mathbf{x}_t, t)\)
    3. 欧拉法更新: \[ \]\[\\mathbf{x}\_{t + \\Delta t} = \\mathbf{x}*t + \\mathbf{v}*\\theta \\cdot \\Delta t \] $$$$(新位置 = 旧位置 + 速度 × 时间)
  4. 结束: 当循环结束时,\(\mathbf{x}_1\) 就是生成的清晰图像。

优势:

  • 更简单: 训练目标 \(\mathbf{x}_1 - \mathbf{x}_0\) 非常直观,摆脱了 DDPM 中复杂的 \(\bar{\alpha}_t, \beta_t\) 系数。
  • 更高效: ODE 路径通常比 SDE 路径“更直”,因此流匹配通常可以用更少的推理步骤(例如 10-50 步)生成高质量图像。
  • 更灵活: 我们可以使用比欧拉法更高级的 ODE 求解器(如 Runge-Kutta)来进一步提高精度和速度。

总结

扩散模型从基础到前沿的全部核心数学:

  1. 基础 (Part 1):高斯分布、朗之万动力学 (SDE) 和贝叶斯公式。
  2. DDPM (Part 2-5)
    • 前向 (q)\(q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})\)
    • 反向 (p)\(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)\) 的推导。
    • 训练:预测噪声 \(L = || \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) ||^2\)
    • 推理\(\mathbf{x}_{t-1} = \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) + \sigma_t \mathbf{z}\) (随机,T 步)。
  3. DDIM (Part 6)
    • 核心:预测 \(\hat{\mathbf{x}}_0\)
    • 推理\(\mathbf{x}_{\tau_{i-1}} = \dots\) (确定性,可跳步)。
  4. 流匹配 (Part 7)
    • 核心:ODE 连续流 \(\frac{d\mathbf{x}_t}{dt} = \mathbf{v}(\mathbf{x}_t, t)\)
    • 训练:预测速度 \(L = || (\mathbf{x}_1 - \mathbf{x}_0) - \mathbf{v}_\theta(\mathbf{x}_t, t) ||^2\)
    • 推理\(\mathbf{x}_{t + \Delta t} = \mathbf{x}_t + \mathbf{v}_\theta \cdot \Delta t\) (ODE 求解)。

PHYS 5120 - 计算能源材料和电子结构模拟 Lecture

1 Møller-Plesset (MP) 微扰理论:

  • 内容:

量子化学中关于 Møller-Plesset (MP) 微扰理论 的推导过程。这是一种在 Hartree-Fock (HF) 方法基础上引入电子相关效应(electron correlation)的后-HF方法。

标准微扰理论公式

这部分列出的是通用的 Rayleigh-Schrödinger 微扰理论 (RSPT) 的基本公式:

  1. 哈密顿算符 (Hamiltonian) 拆分:
    • \(\hat{H} = \hat{H}_0 + \hat{H}'\)
    • 将体系的总哈密顿算符 \(\hat{H}\) 拆分为一个可以精确求解的零阶哈密顿算符 \(\hat{H}_0\) 和一个微扰项 \(\hat{H}'\)
  2. 能量和波函数的展开:
    • \(E_n = E_n^{(0)} + E_n^{(1)} + E_n^{(2)} + \dots\)
    • \(|n\rangle = |n^{(0)}\rangle + |n^{(1)}\rangle + |n^{(2)}\rangle + \dots\)
    • 体系的真实能量 \(E_n\) 和波函数 \(|n\rangle\) 可以展开为零阶、一阶、二阶等各级修正项的总和。
  3. 一阶能量修正 (First-order energy correction):
    • \(E_n^{(1)} = \langle n^{(0)} | \hat{H}' | n^{(0)} \rangle\)
    • 一阶能量修正是微扰算符 \(\hat{H}'\) 在零阶波函数 \(|n^{(0)}\rangle\) 下的期望值。
  4. 二阶能量修正 (Second-order energy correction):
    • \(E_n^{(2)} = \sum_{k \neq n} \frac{|\langle k^{(0)} | \hat{H}' | n^{(0)} \rangle|^2}{E_n^{(0)} - E_k^{(0)}}\)
    • 二阶能量修正涉及所有其他零阶本征态 \(|k^{(0)}\rangle\)

Møller-Plesset 理论的定义

这部分将上述通用公式应用到 MP 理论中,定义了 \(\hat{H}_0\)\(\hat{H}'\)

  1. MP 理论的哈密顿算符定义:
    • 零阶哈密顿 \(\hat{H}_0\):
      • \(\hat{H}_0 = \hat{F} = \sum_{i=1}^N \hat{f}(i)\) (注:\(\hat{F}\) 是 Fock 算符,\(\hat{H}_0\) 被定义为 \(N\) 个电子的 Fock 算符之和)
    • 微扰项 \(\hat{H}'\):
      • \(\hat{H}' = \hat{H} - \hat{F}\) (即 \(\hat{H} - \hat{H}_0\))
      • 微扰项是真实的哈密顿算符 \(\hat{H}\) (包含完整的电子-电子排斥) 与 Fock 算符 \(\hat{F}\) (只包含平均化的电子排斥) 之间的差值。这个差值就是 HF 理论所忽略的“电子相关”。
  2. 零阶波函数和能量:
    • $|_0= $ HF Slater determinant
    • 零阶波函数(即 \(\hat{H}_0\) 的本征函数)被选为 Hartree-Fock (HF) 斯莱特行列式
    • \(\hat{H}_0 |\Phi_0\rangle = (\sum_{i=1}^N \varepsilon_i) |\Phi_0\rangle\)
    • 零阶能量 \(E^{(0)}\) (即 \(E_{MP0}\)) 是所有被占据分子轨道 \(\varepsilon_i\) 的能量总和。
    • \(E_{MP0} = \langle \Phi_0 | \hat{H}_0 | \Phi_0 \rangle = \sum_{i=1}^N \varepsilon_i\)
  3. MP1 能量 (一阶能量修正):
    • \(E_{MP1} = \langle \Phi_0 | \hat{H}' | \Phi_0 \rangle = \langle \Phi_0 | \hat{H} - \hat{F} | \Phi_0 \rangle\)
    • \(E_{MP1} = \langle \Phi_0 | \hat{H} | \Phi_0 \rangle - \langle \Phi_0 | \hat{F} | \Phi_0 \rangle\)
    • 由于 \(\langle \Phi_0 | \hat{H} | \Phi_0 \rangle\) 正是 Hartree-Fock 能量 \(E_{HF}\),而 \(\langle \Phi_0 | \hat{F} | \Phi_0 \rangle = E_{MP0} = \sum \varepsilon_i\)
    • 因此:\(E_{MP1} = E_{HF} - \sum_{i=1}^N \varepsilon_i\)

重要结论: 零阶能量 (\(E_{MP0}\)) 和一阶能量 (\(E_{MP1}\)) 的总和恰好等于 Hartree-Fock 能量: \(E_{MP0} + E_{MP1} = (\sum \varepsilon_i) + (E_{HF} - \sum \varepsilon_i) = E_{HF}\) 这意味着 MP1 理论得到的总能量就是 HF 能量。要获得对 HF 能量的第一个修正(即电子相关能),我们必须计算到二阶,即 MP2

2. MP2 能量推导

这部分是 MP 理论的核心,推导了 MP2 能量(二阶能量修正)

  1. MP2 通用公式:
    • \(E_{MP2} = \sum_{k \neq 0} \frac{|\langle k | \hat{H}' | 0 \rangle|^2}{E_0 - E_k}\)
    • 这是将左侧的 \(E_n^{(2)}\) 公式应用于基态 (\(n=0\))。
  2. 关键简化 (Brillouin 定理):
    • 根据 Brillouin 定理,对于所有单激发行列式 \(|\Phi_i^a\rangle\) (即一个电子从占据轨道 \(i\) 跃迁到空轨道 \(a\)),矩阵元 \(\langle \Phi_i^a | \hat{H}' | \Phi_0 \rangle = 0\)
    • 这意味着在求和 \(\sum_{k \neq 0}\) 时,所有单激发的项都为零。
    • 因此,对 MP2 能量有贡献的第一类非零项来自双激发行列式 (Doubly-excited determinants),记为 \(|\Phi_{ij}^{ab}\rangle\)(即电子 \(i, j\) 激发到空轨道 \(a, b\))。
    • 白板中间的图示 || -> # (两条线变到两条更高阶的线) 正是形象地表示了这种双激发。
  3. MP2 最终公式:
    • \(E_{MP2} = \sum_{i<j}^{occ} \sum_{a<b}^{vir} \frac{|\langle \Phi_{ij}^{ab} | \hat{H}' | \Phi_0 \rangle|^2}{\varepsilon_i + \varepsilon_j - \varepsilon_a - \varepsilon_b}\)
    • 求和: 遍历所有占据轨道对 (\(i, j\)) 和所有空轨道(虚拟轨道)对 (\(a, b\))。
    • 分母: \(E_0 - E_k = (\sum \varepsilon) - (\sum \varepsilon - \varepsilon_i - \varepsilon_j + \varepsilon_a + \varepsilon_b) = \varepsilon_i + \varepsilon_j - \varepsilon_a - \varepsilon_b\)。这是零阶能量差,即激发所消耗的轨道能量。
    • 分子: \(\langle \Phi_{ij}^{ab} | \hat{H}' | \Phi_0 \rangle\) 可以被简化为一个关于分子轨道的双电子积分 \(\langle ij || ab \rangle\)

总结: Møller-Plesset 微扰理论的标准推导,核心思想是将 HF 解作为零阶近似,并将电子相关的“剩余部分”作为微扰。推导表明,MP1 能量只是重现了 HF 能量,而 MP2 能量是第一个真正包含电子相关效应的修正项,它通过计算所有可能的双激发对总能量的贡献来实现。

从 Møller-Plesset (MP) 理论过渡到了另一种更高级的量子化学方法:耦合簇理论 (Coupled Cluster Theory)

MP2 能量的深入探讨

这部分继续讨论 MP2 能量(二阶能量修正):

  1. MP2 能量公式 (重复):
    • \(E_{MP2} = \sum_{k \neq 0} \frac{|\langle \Phi_0 | \hat{H}' | \Phi_k \rangle|^2}{E_{MP0} - E_k^{(0)}}\)
    • 这是 MP2 能量的通用形式。
  2. 微扰的来源:
    • \(\frac{e^2}{|r_i - r_j|}\) (被圈出)
    • 这是电子-电子间的库仑排斥算符。这个项是 \(\hat{H}\)(真实哈密顿算符)和 \(\hat{H}_0\)(Fock 算符,仅包含平均场排斥)之间差异的核心。正是这个“瞬时相关”的相互作用导致了 \(\hat{H}'\)(微扰)的存在,也是 MP 理论试图修正的能量来源。
  3. Brillouin 定理的应用:
    • \(\langle \Phi_0 | \hat{H}' | \Phi_i^a \rangle = 0\)
    • 这再次强调了上一张白板的结论:微扰算符 \(\hat{H}'\) 在基态 \(\Phi_0\) 和任何单激发行列式 \(\Phi_i^a\) 之间的矩阵元为零。
    • \(\langle \Phi_0 | \hat{H} - \hat{F} | \Phi_i^a \rangle = 0\) (这是 \(\hat{H}'\) 的展开)
    • \(\langle \Phi_a | \hat{f} | \Phi_i \rangle = 0\) (注:\(\Phi_i\)\(\Phi_a\) 应为轨道 \(\phi_i\)\(\phi_a\))
    • 这一行解释了为什么 Brillouin 定理在 MP 理论中成立:因为在标准的 Hartree-Fock (HF) 方法中,占据轨道 (\(\phi_i\)) 和空轨道 (\(\phi_a\)) 之间的 Fock 矩阵元(即 \(\langle \phi_a | \hat{f} | \phi_i \rangle\))被设为零。
  4. MP2 最终实用公式:
    • \(E_{MP2} = \sum_{i<j}^{occ} \sum_{a<b}^{vir} \frac{|\langle \Phi_0 | \hat{H}' | \Phi_{ij}^{ab} \rangle|^2}{E_{MP0} - E_{ij}^{ab}}\)
    • 由于单激发项为零,求和 \(k \neq 0\) 中幸存下来的第一类项就是双激发\(\Phi_{ij}^{ab}\)(电子从 \(i,j\) 轨道激发到 \(a,b\) 轨道)。
    • \(E_{ij}^{ab}\) 是双激发组态的零阶能量。
  5. MP2 能量的性质:
    • \(< 0\): MP2 修正能量总是负值。这意味着 MP2 能量总是在 HF 能量(\(E_{MP0} + E_{MP1}\))的基础上进一步降低总能量。这是符合物理直觉的,因为电子相关效应允许电子更好地相互“躲避”,从而降低体系的总能量。
    • “80%” (大约): 这是一个常见的经验之谈,即对于许多小分子,MP2 方法大约能“恢复”80% 到 90% 的电子相关能。

3. 耦合簇理论 (Coupled Cluster Theory)

这部分介绍了一种更强大、更准确(也更昂贵)的后-HF方法。

  1. 标题: Coupled Cluster Theory

  2. CC 波函数拟设 (Ansatz):

    • \(\Psi_{CC} = e^{\hat{T}} \Phi_{HF}\) (这里 \(\Phi_{HF}\)\(\Phi_0\))
    • 这是 CC 理论的核心!它假设真实的波函数 \(\Psi_{CC}\) 可以通过一个“指数算符” \(e^{\hat{T}}\) 作用在 HF 参考波函数 \(\Phi_0\) 上得到。
  3. 指数算符 \(e^{\hat{T}}\):

    • \(e^{\hat{T}} = \hat{1} + \hat{T} + \frac{1}{2!} \hat{T}^2 + \frac{1}{3!} \hat{T}^3 + \dots\)
    • 指数算符通过泰勒级数展开。这种指数形式具有非常重要的特性,即它能自动包含“非关联”的高阶激发(例如,两次独立的双激发 \(\hat{T}_2^2\) 会产生四激发),这使得 CC 方法具有大小一致性 (size-consistency),这是 MP 理论(在某些阶数上)所缺乏的重要特性。
  4. 簇算符 (Cluster Operator) \(\hat{T}\):

    • \(\hat{T} = \hat{T}_1 + \hat{T}_2 + \hat{T}_3 + \dots\)
    • 簇算符 \(\hat{T}\) 本身是所有可能的激发算符的总和。
  5. 激发算符的定义:

    • \(\hat{T}_1\) (单激发):
      • \(\hat{T}_1 \Phi_0 = \sum_{i}^{occ} \sum_{a}^{vir} t_i^a \Phi_i^a\)
      • \(\hat{T}_1\) 产生所有可能的单激发态的线性组合。\(t_i^a\) 是未知的“振幅 (amplitudes)”,即激发的重要性权重,需要通过求解 CC 方程得到。
    • \(\hat{T}_2\) (双激发):
      • \(\hat{T}_2 \Phi_0 = \sum_{i<j}^{occ} \sum_{a<b}^{vir} t_{ij}^{ab} \Phi_{ij}^{ab}\)
      • \(\hat{T}_2\) 产生所有可能的双激发态的线性组合。\(t_{ij}^{ab}\) 是双激发的振幅。

总结: 这张白板从 MP2 理论(一种基于微扰的方法)过渡到了耦合簇理论(一种基于指数拟设的非微扰方法)。

  • MP2 通过二阶微扰,只显式地考虑了双激发对能量的贡献。
  • Coupled Cluster (例如 CCSD,即 \(\hat{T} = \hat{T}_1 + \hat{T}_2\)) 则系统地包含了 \(\hat{T}_1\) (单激发) 和 \(\hat{T}_2\) (双激发) 及其所有乘积(如 \(\hat{T}_1^2, \hat{T}_1 \hat{T}_2, \hat{T}_2^2\) 等),因此它隐式地包含了某些更高阶的激发(如四激发),使其成为比 MP2 更准确、更鲁棒的方法。

当然可以。这是一个非常核心的量子化学概念。

简单来说:双激发 (Double Excitations) 是描述电子“躲避”彼此这一行为的最主要最基本的数学方式。

MP2 和 CCSD 都高度关注双激发,因为它们是捕获“电子相关能” (Electron Correlation Energy) 的关键。

问题的根源:HF 理论错过了什么?

在白板的 MP 理论推导中,我们从 Hartree-Fock (HF) 波函数 (\(\Phi_0\)) 开始。

  • HF 的问题: HF 是一种“平均场”理论。它假设一个电子(例如电子 \(i\))只感受到所有其他电子(例如电子 \(j\))的平均电荷分布,而不是它们在某一瞬时的真实位置。
  • 物理现实: 电子是带负电的粒子,它们会瞬时地相互排斥。如果电子 \(i\) 在分子的 A 点,电子 \(j\) 会倾向于避开 A 点,跑到 B 点去。这种为了“躲避”对方而调整自己行为的现象,就叫做 电子相关 (Electron Correlation)

HF 理论忽略了这种瞬时躲避,因此 HF 能量总是高于真实的基态能量。我们所说的“电子相关能”,就是这个能量差。

解决方案:双激发(\(\Phi_{ij}^{ab}\)

我们如何用数学来描述“电子 \(i\)\(j\) 相互躲避”?

想象一下,在 HF 基态 (\(\Phi_0\)) 中,电子 \(i\)\(j\) 在它们各自的轨道里。

  • 为了描述它们“躲开”彼此,我们需要在波函数中“混入”一个新的组态 (configuration)。
  • 在这个新组态中,电子 \(i\)\(j\) 同时从它们原来的轨道(占据轨道 \(i, j\))“跳”到了两个新的、能量更高的空轨道(虚拟轨道 \(a, b\))。
  • 这个“双重跳跃”的态,就是白板上的 双激发态 \(\Phi_{ij}^{ab}\)

通过将这个双激发态 \(\Phi_{ij}^{ab}\) 线性叠加到基态 \(\Phi_0\) 中,我们的总波函数就能描述这样一种情形:电子 \(i\)\(j\) 有一定概率不在它们“应该”在的地方,而是跑到了别处——这就是它们相互躲避的数学表达。

为什么单激发 (\(\Phi_i^a\)) 不行? 如白板所示,Brillouin 定理(\(\langle \Phi_0 | \hat{H}' | \Phi_i^a \rangle = 0\))告诉我们,在 MP 理论框架下,HF 基态和单激发态之间没有直接的相互作用。单激发主要描述的是轨道本身的形状调整(轨道弛豫),而不是电子之间的相关。

MP2 和 CCSD 的联系与区别

MP2 和 CCSD 都是基于这个核心思想,但实现方式的复杂程度和准确性完全不同。

🔹 MP2 (Møller-Plesset 2nd Order)

  • 它做了什么: MP2 是一种微扰方法。它把双激发当作一种“微小的扰动”。
  • 如何工作: 它使用二阶微扰公式(白板上的 \(E_{MP2}\))来计算:“如果我允许体系中的每一对电子 (i, j) 发生一次双激发 (到 a, b),这会使体系的总能量降低多少?”
  • 局限性:
    1. 它只计算一次激发的影响。
    2. 它只考虑了双激发。
    3. 它假设这种扰动很“小”。

MP2 是最简单、计算最便宜的包含电子相关的方法,它抓住了相关能的“大头”( ~80%)。

🔹 CCSD (Coupled Cluster Singles and Doubles)

  • 它做了什么: CCSD 是一种非微扰方法。它不认为相关是“小扰动”,而是从根本上重新构建波函数。
  • 如何工作: 它使用指数拟设 \(\Psi_{CC} = e^{\hat{T}_1 + \hat{T}_2} \Phi_0\)
    • \(\hat{T}_2\) 算符(白板上有写)就是用来产生所有双激发的。
    • \(\hat{T}_1\) 算符(白板上有写)用来产生所有单激发(用于轨道弛豫)。
  • 关键区别(指数的魔力):
    • 当指数 \(e^{\hat{T}}\) 展开时 (\(e^{\hat{T}} = 1 + \hat{T} + \frac{1}{2}\hat{T}^2 + \dots\)),你会得到像 \(\frac{1}{2}(\hat{T}_2)^2\) 这样的项。
    • \(\frac{1}{2}(\hat{T}_2)^2\) 意味着两次独立的双激发。这在物理上代表了四激发(例如,分子中两对互不相干的电子同时在各自“躲避”)。
    • 这就是 CCSD 远比 MP2 准确的原因:它不仅包含了 \(\hat{T}_2\)(双激发),还通过指数形式自动包含了由 \(\hat{T}_2\) 组合产生的高阶激发(如四激发、六激发等)。它正确地描述了多个电子对同时发生相关行为的情况。

总结对比

特性 MP2 CCSD ( \(\hat{T} = \hat{T}_1 + \hat{T}_2\) )
方法类型 微扰理论 (Perturbative) 非微扰 (Non-perturbative)
核心思想 计算双激发对能量的二阶修正。 用指数算符包含所有单、双激发
包含的激发 仅显式包含双激发 显式包含单、双激发
隐式包含高阶激发(四、六等)。
准确性 良好(约 80-90% 相关能) 优秀(“黄金标准”之一)
数学联系 MP2 的能量公式可以被证明是 CCSD 方程的最低阶近似 更完整、更高级的理论。

总结:

双激发是描述电子相关(相互躲避)的物理核心。MP2 用最简单的方式估算了它的能量贡献,而 CCSD 则用一种更完备、更强大的数学(指数)形式将其及组合效应系统地包含了进来。

4. 量子蒙特卡洛 (Quantum Monte Carlo)

耦合簇理论 (Coupled Cluster)

1. CCSD(T):“(T)” 的含义

  • 回顾 CCSD 正如标题 Coupled Cluster Singles and Doubles (CCSD) 所示,CCSD 方法只完整地包含了单激发算符 (\(\hat{T}_1\)) 和双激发算符 (\(\hat{T}_2\))。它通过指数形式 \(e^{(\hat{T}_1 + \hat{T}_2)}\),已经能间接地包含一些高阶激发(如四激发 \(\hat{T}_2^2\))。
  • 缺失的项: CCSD 没有 显式地包含 “连通”(connected) 三激发算符 \(\hat{T}_3\)\(\hat{T}_3\) 描述的是三个电子同时相互关联并激发到三个空轨道。
  • (T) 的含义: (T) 代表 “微扰三激发” (perturbative Triples)
    • 为什么不直接用 CCSDT 完整地求解包含 \(\hat{T}_3\) 的方程(即 CCSDT 方法)在计算上极其昂贵(计算量随体系大小的 \(N^8\) 增长)。
    • (T) 的解决之道: CCSD(T) 是一种巧妙的折中。它首先执行一个完整的、较便宜的 CCSD 计算(求解 \(\hat{T}_1\)\(\hat{T}_2\))。然后,它使用这些已知的 \(\hat{T}_1\)\(\hat{T}_2\) 振幅,通过微扰理论(就像第一张白板上的 MP 理论一样!)来估算 \(\hat{T}_3\) 会对总能量产生的修正
  • “黄金标准” (Gold Standard): CCSD(T) 方法被广泛认为是量子化学的“黄金标准”。它以可承受的计算成本(\(N^7\) 增长)提供了接近“完美”的能量,是绝大多数高精度计算的基准。
  • “95%” 注释: 白板上的 95% 注释很可能是讲师的一个经验之谈,即 CCSD 大约能恢复 95% 的电子相关能(相比 MP2 的 ~80%),而 (T) 修正则能将这个数字推向 99% 甚至更高。

2. 耦合簇方程 (CC Equations)

白板的中间部分展示了如何求解 CCSD 方程以获得能量 \(E\) 和振幅 \(t_i^a, t_{ij}^{ab}\)

  • 总薛定谔方程:

    • \(H e^{\hat{T}} | \Phi_0 \rangle = E e^{\hat{T}} | \Phi_0 \rangle\)
    • 这是 CC 理论要解的薛定谔方程 (\(\hat{H} |\Psi_{CC}\rangle = E |\Psi_{CC}\rangle\))。
  • 求解方法(投影法): 为了解出未知的 \(E\), \(\hat{T}_1\), \(\hat{T}_2\),我们将这个总方程“投影”到不同的激发空间上:

    1. 能量 \(E\)
      • \(\langle \Phi_0 | e^{-(\hat{T}_1+\hat{T}_2)} \hat{H} e^{(\hat{T}_1+\hat{T}_2)} | \Phi_0 \rangle = E\)
      • 将总方程左乘 \(\langle \Phi_0 |\)(HF 基态)并积分,可以直接得到总能量 \(E\)
    2. 求解 \(\hat{T}_1\)(方程 ①):
      • \(\langle \Phi_i^a | e^{-(\hat{T}_1+\hat{T}_2)} \hat{H} e^{(\hat{T}_1+\hat{T}_2)} | \Phi_0 \rangle = 0\)
      • 将总方程左乘 \(\langle \Phi_i^a |\)(所有单激发态)。这会产生一系列方程,求解它们可以得到所有 \(\hat{T}_1\) 的振幅 \(t_i^a\)
    3. 求解 \(\hat{T}_2\)(方程 ②):
      • \(\langle \Phi_{ij}^{ab} | e^{-(\hat{T}_1+\hat{T}_2)} \hat{H} e^{(\hat{T}_1+\hat{T}_2)} | \Phi_0 \rangle = 0\)
      • 将总方程左乘 \(\langle \Phi_{ij}^{ab} |\)(所有双激发态)。这会产生另一系列方程,求解它们可以得到所有 \(\hat{T}_2\) 的振幅 \(t_{ij}^{ab}\)

总结: CCSD 是一个复杂的非线性方程组。我们通过求解方程 ① 和 ② 得到所有激发振幅,然后将这些振幅代入能量方程,得到最终的 CCSD 能量。

量子蒙特卡洛 (Quantum Monte Carlo, QMC)

这部分介绍了一种完全不同的、不依赖于轨道和激发的方法。

  • 标题: Quantum Monte Carlo
  • 子标题: Variational / Diffusion MC (变分蒙特卡洛 / 扩散蒙特卡洛)

变分蒙特卡洛 (VMC)

VMC 的核心思想,基于变分原理

  1. 能量期望值:
    • \(E(\theta) = \frac{\langle \Psi(\theta) | \hat{H} | \Psi(\theta) \rangle}{\langle \Psi(\theta) | \Psi(\theta) \rangle}\)
    • 任何一个“试验波函数” \(\Psi(\theta)\)\(\theta\) 是函数中的可调参数)所计算出的能量 \(E(\theta)\) 永远大于或等于真实的基态能量 \(E_{\text{ground}}\)
    • 目标: min $E(\theta)$ = $E_{\text{ground}}$ (更准确地说是 \(E_{\text{approx}}\))
    • 通过调整参数 \(\theta\) 来最小化 \(E(\theta)\),我们可以获得对基态能量的最佳近似。
  2. 蒙特卡洛方法如何计算?
    • 上述能量公式是一个极其复杂的高维积分(积分维度 = 3 \(\times\) 电子数)。
    • QMC 它不直接计算这个积分,而是使用随机抽样(“蒙特卡洛”方法)来估算它。
  3. VMC 计算步骤(如白板所示):
    • a. 重写积分: \(E = \frac{\int |\Psi(\vec{x}, \theta)|^2 \left[ \frac{\hat{H} \Psi(\vec{x}, \theta)}{\Psi(\vec{x}, \theta)} \right] d\vec{x}}{\int |\Psi(\vec{x}, \theta)|^2 d\vec{x}}\)
      • \(\vec{x}\) 代表所有电子的坐标 (\(r_1, r_2, \dots\))。
      • \(\frac{|\Psi(\vec{x})|^2}{\int |\Psi(\vec{x})|^2 d\vec{x}}\) 是电子在 \(\vec{x}\) 处被发现的概率密度 \(P(\vec{x})\)
      • \(E_L(\vec{x}) = \frac{\hat{H} \Psi(\vec{x}, \theta)}{\Psi(\vec{x}, \theta)}\) 被称为 “局域能量” (Local Energy)
    • b. 抽样:
      • 整个积分就变成了在 \(P(\vec{x})\) 概率分布下,对 \(E_L(\vec{x})\) 求平均值。
      • VMC 方法通过一种算法(如 Metropolis 算法)产生大量的、符合 \(|\Psi|^2\) 分布的随机电子构型(“walkers”)。
    • c. 求平均:
      • \(E \approx \frac{1}{N} \sum_{\text{samples } \vec{x}} E_L(\vec{x})\)
      • 最终的能量就是所有采样点上“局域能量”的简单平均值。

总结: VMC 通过在“真实空间”中随机移动电子来直接估算能量,完全绕过了 MP 或 CC 理论中复杂的轨道和激发概念。Diffusion MC (DMC) 是其更高级的变体,原则上可以找到精确的基态能量。

这张白板标志着一个重大的理论转变:从前面讨论的“波函数方法”(Wavefunction Methods, 如 MP2, CCSD)转向了一种完全不同的、在计算化学和物理中占主导地位的方法——密度泛函理论 (Density Functional Theory, DFT)

5. 波函数方法的“维度灾难”

一个生动的例子,说明了为什么基于波函数的方法(如 CCSD)在计算上极其昂贵,甚至是不可能的。

  1. “wavefunction method” (波函数方法):
    • 这类方法(如 HF, MP2, CCSD)的中心目标是求解体系的 \(N\) 电子波函数 \(\Psi(\vec{r}_1, \vec{r}_2, \dots, \vec{r}_N)\)
    • 这个波函数 \(\Psi\) 是一个极其复杂的对象,它是一个 \(3N\) 维度的函数(每个电子有 3 个空间坐标 \(x, y, z\))。
  2. 计算成本的“立方体”比喻:
    • 假设我们想在一个 3D 空间中存储一个电子的波函数。
    • 如果我们在每个维度(x, y, z)上只使用 10 个网格点,我们就需要 \(10 \times 10 \times 10 = 10^3\) 个点来描述这一个电子。
    • 现在,考虑一个有 30 个电子的体系(一个中等大小的分子)。
    • 由于总波函数 \(\Psi\)\(3 \times 30 = 90\) 维的,我们需要 \((10^3)^{30} = 10^{90}\) 个网格点来存储这个波函数。
    • \(10^{90}\) 是一个天文数字! 白板上的 \(10^{77}\)\(10^{80}\) 可能是用来比较的数字(例如,可观测宇宙中的原子总数约 \(10^{80}\))。这个数字 (\(10^{90}\)) 意味着直接存储或计算 N 电子波函数在计算上是绝对不可能的。
    • 这就是所谓的“维度灾难 (Curse of Dimensionality)”。
  3. DFT:解决方案登场
    • 在指出了波函数方法的根本困难后,白板上写下了 DFT,预示着它是一种解决方案。
  4. 体系的哈密顿算符 (\(\hat{H}\)):
    • \(\hat{H} = \underbrace{-\sum_i \frac{\hbar^2}{2m} \nabla_i^2}_{\text{电子动能}} + \underbrace{\sum_i V_{\text{ext}}(\vec{r}_i)}_{\text{电子-原子核吸引}} + \underbrace{\frac{1}{2} \sum_{i \neq j} \frac{e^2}{|\vec{r}_i - \vec{r}_j|}}_{\text{电子-电子排斥}}\)
    • 这是任何分子体系的完整(非相对论)哈密顿算符。
  5. 关键的简化:
    • \(\langle \Psi | \sum_i V_{\text{ext}}(\vec{r}_i) | \Psi \rangle = \int V_{\text{ext}}(\vec{r}) \rho(\vec{r}) d\vec{r}\)
    • 这一行展示了一个至关重要的简化:\(N\) 电子的“电子-原子核吸引能”的期望值(一个 \(3N\) 维积分),可以被精确地重写为一个只涉及电子密度 \(\rho(\vec{r})\) 的** 3 维积分**。
    • 这就引出了一个问题:我们是否能用这个简单的 \(\rho(\vec{r})\) 来代替 \(\Psi\) 呢?

DFT 的理论基石

这部分介绍了 DFT 的核心定理

  1. 电子密度 (Electron Density, \(\rho(\vec{r})\)):
    • \(\int \rho(\vec{r}) d\vec{r} = N\)
    • 电子密度 \(\rho(\vec{r})\) 是一个在 3D 空间中的函数。它描述了在任意空间点 \(\vec{r}\) 处找到一个电子的概率。
    • 无论体系有多少电子(\(N=30\)\(N=1000\)),\(\rho(\vec{r})\) 始终是一个简单的 3 维函数。这与 \(3N\) 维的波函数 \(\Psi\) 形成了鲜明对比。
  2. Hohenberg-Kohn (H-K) 定理:
    • 这是 DFT 的全部理论基础。白板上的图示完美地总结了第一个 H-K 定理:
    • 标准路径(上 \(\rightarrow\) 下): \(V_{\text{ext}}(\vec{r})\)(外势,即原子核的位置和电荷)\(\implies \Psi(\vec{r})\)(基态波函数)\(\implies \rho(\vec{r})\)(基态密度)。
      • 解释: 原子核的位置决定了 \(\hat{H}\),求解 \(\hat{H}\) 得到 \(\Psi\),由 \(\Psi\) 可以计算出 \(\rho\)
    • H-K 革命性路径(下 \(\leftrightarrow\) 上): \(V_{\text{ext}}(\vec{r}) \Longleftrightarrow \rho(\vec{r})\)
      • 第一 H-K 定理证明:体系的基态电子密度 \(\rho(\vec{r})\) 唯一地决定了外势 \(V_{\text{ext}}(\vec{r})\)(因此也唯一决定了 \(\hat{H}\)\(\Psi\) 和总能量 \(E\))。
  3. H-K 定理的重大意义:
    • 所有信息都在 \(\rho(\vec{r})\) 中! 体系的基态总能量 \(E\) 是基态密度 \(\rho\) 的一个泛函 (Functional),记为 \(E[\rho]\)
    • 目标转变: 我们不再需要去求解那个 \(10^{90}\) 维的波函数 \(\Psi\)!我们只需要找到那个能使总能量 \(E[\rho]\) 最小的 3 维密度 \(\rho(\vec{r})\)

总结: 首先论证了“波函数方法”在计算上的不可能性(维度灾难),然后引入了 DFT 作为解决方案,其理论依据是 H-K 定理——即体系的所有信息都包含在简单的 3 维电子密度 \(\rho(\vec{r})\) 中。

6:证明 第一 Hohenberg-Kohn (H-K) 定理的经典证明。这个定理是密度泛函理论 (DFT) 的基石。

这个证明使用的是 “Proof by contradiction” (反证法)

定理内容: 体系的基态电子密度 \(\rho_0(\vec{r})\) 唯一地决定了其外势 \(V_{ext}(\vec{r})\)(即原子核的位置和电荷),并因此唯一地决定了体系的哈密顿算符 \(\hat{H}\) 和波函数 \(\Psi\)

证明步骤详解

1. 假设 (为反证法设置)

我们假设 H-K 定理是的。

这意味着我们假设存在两个不同 (different) 的外势 \(V_{ext}^{(1)}\)\(V_{ext}^{(2)}\),它们分别对应各自的哈密顿算符 \(\hat{H}^{(1)}\)\(\hat{H}^{(2)}\) 和基态波函数 \(\Psi^{(1)}\)\(\Psi^{(2)}\)

但我们假设,这两个完全不同的体系却碰巧产生了完全相同 (same) 的基态电子密度 \(\rho_0(\vec{r})\)

假设总结: * \(V_{ext}^{(1)} \neq V_{ext}^{(2)}\) (因此 \(\hat{H}^{(1)} \neq \hat{H}^{(2)}\)\(\Psi^{(1)} \neq \Psi^{(2)}\)) * 但是 \(\rho^{(1)}(\vec{r}) = \rho^{(2)}(\vec{r}) = \rho_0(\vec{r})\) * \(E^{(1)}\) 是体系1的基态能量。 * \(E^{(2)}\) 是体系2的基态能量。

2. 应用变分原理 (第 1 步)

变分原理指出:任何一个“试验波函数” \(\Psi_{\text{trial}}\) 对某个哈密顿算符 \(\hat{H}\) 的能量期望值,永远高于或等于该 \(\hat{H}\) 的真实基态能量 \(E_{GS}\)

  • 我们将 \(\Psi^{(2)}\) (体系2的基态波函数) 作为体系1的试验波函数
  • 根据变分原理,\(\Psi^{(2)}\) 计算出的 \(\hat{H}^{(1)}\) 的能量必定高于 \(\hat{H}^{(1)}\) 的真实基态能量 \(E^{(1)}\)(因为我们假设了 \(\Psi^{(1)} \neq \Psi^{(2)}\))。
  • \(E^{(1)} < \langle \Psi^{(2)} | \hat{H}^{(1)} | \Psi^{(2)} \rangle\) (白板上的第3行)

3. 展开能量 (第 2 步)

我们来展开 \(\langle \Psi^{(2)} | \hat{H}^{(1)} | \Psi^{(2)} \rangle\) 这一项。

  • 我们知道 \(\hat{H}^{(1)} = \hat{H}^{(2)} + (V_{ext}^{(1)} - V_{ext}^{(2)})\)

  • 代入上式: \(\langle \Psi^{(2)} | \hat{H}^{(1)} | \Psi^{(2)} \rangle = \langle \Psi^{(2)} | \hat{H}^{(2)} + V_{ext}^{(1)} - V_{ext}^{(2)} | \Psi^{(2)} \rangle\)

  • 拆分它: \(= \langle \Psi^{(2)} | \hat{H}^{(2)} | \Psi^{(2)} \rangle + \langle \Psi^{(2)} | V_{ext}^{(1)} - V_{ext}^{(2)} | \Psi^{(2)} \rangle\)

  • 第一项 \(\langle \Psi^{(2)} | \hat{H}^{(2)} | \Psi^{(2)} \rangle\) 正是体系2的基态能量 \(E^{(2)}\)

  • 第二项(如上一张白板所示)可以写成关于密度的积分: \(\int (V_{ext}^{(1)} - V_{ext}^{(2)}) \rho^{(2)}(\vec{r}) d\vec{r}\)

  • 根据我们的初始假设,\(\rho^{(2)}(\vec{r}) = \rho_0(\vec{r})\)

  • 将这些组合起来,第1步的变分不等式 \(E^{(1)} < \dots\) 就变成了: \(E^{(1)} < E^{(2)} + \int (V_{ext}^{(1)} - V_{ext}^{(2)}) \rho_0(\vec{r}) d\vec{r}\) (白板上的第5行)

4. 对称地应用变分原理 (第 3 步)

现在我们反过来,将 \(\Psi^{(1)}\) 作为体系2的试验波函数

  • 根据变分原理: \(E^{(2)} < \langle \Psi^{(1)} | \hat{H}^{(2)} | \Psi^{(1)} \rangle\)

  • 我们用同样的方法展开 \(\hat{H}^{(2)} = \hat{H}^{(1)} + (V_{ext}^{(2)} - V_{ext}^{(1)})\)\(\langle \Psi^{(1)} | \hat{H}^{(2)} | \Psi^{(1)} \rangle = \langle \Psi^{(1)} | \hat{H}^{(1)} | \Psi^{(1)} \rangle + \langle \Psi^{(1)} | V_{ext}^{(2)} - V_{ext}^{(1)} | \Psi^{(1)} \rangle\)

  • 这等于: \(= E^{(1)} + \int (V_{ext}^{(2)} - V_{ext}^{(1)}) \rho^{(1)}(\vec{r}) d\vec{r}\)

  • 再次使用我们的假设 \(\rho^{(1)}(\vec{r}) = \rho_0(\vec{r})\)

  • 因此,我们得到了第二个不等式: \(E^{(2)} < E^{(1)} + \int (V_{ext}^{(2)} - V_{ext}^{(1)}) \rho_0(\vec{r}) d\vec{r}\) (白板上的第6行)

5. 导出矛盾 (第 4 步)

现在我们把两个不等式(第5行和第6行)相加:

\(E^{(1)} + E^{(2)} < \left[ E^{(2)} + \int (V_{ext}^{(1)} - V_{ext}^{(2)}) \rho_0 d\vec{r} \right] + \left[ E^{(1)} + \int (V_{ext}^{(2)} - V_{ext}^{(1)}) \rho_0 d\vec{r} \right]\)

我们来合并右侧的项:

\(E^{(1)} + E^{(2)} < (E^{(1)} + E^{(2)}) + \int \underbrace{[(V_{ext}^{(1)} - V_{ext}^{(2)}) + (V_{ext}^{(2)} - V_{ext}^{(1)})]}_{= 0} \rho_0 d\vec{r}\)

右侧的两个积分项完全抵消,变成了 0。

  • 于是我们得到了最终的荒谬结论: \(E^{(1)} + E^{(2)} < E^{(1)} + E^{(2)}\) (白板上的最后一行)

6. 结论

“一个数严格小于它自己” (\(A < A\)) 是一个数学上不可能的悖论。

这个悖论证明了我们的初始假设一定是错误的

因此,两个不同的外势 \(V_{ext}^{(1)}\)\(V_{ext}^{(2)}\) 不可能 产生相同的基态密度 \(\rho_0\)

证明完毕: 体系的基态电子密度 \(\rho_0(\vec{r})\) 唯一地决定了其外势 \(V_{ext}(\vec{r})\),并因此决定了体系的所有基态性质。

7. 更多

Hohenberg-Kohn (H-K) 定理证明了 \(E[\rho]\)(能量是密度的泛函)的存在性,但它没有告诉我们这个泛函到底长什么样。

问题在于: 我们不知道电子动能 \(T[\rho]\) 和电子-电子排斥能 \(V_{ee}[\rho]\) 的精确泛函形式。

Kohn-Sham (KS) 理论 就是为了解决这个问题而提出的。

这个理论的核心思想是:“我们假装在解一个简单问题,然后把所有的复杂性都藏在一个我们去近似的项里。”

以下是这个实践过程的分解:

1. 引入一个“虚拟”的无相互作用体系

Kohn 和 Sham 假设存在一个虚拟的无相互作用(non-interacting)的电子体系。

这个虚拟体系有一个关键的约束:它被设计为与真实的、有相互作用的体系具有完全相同的基态电子密度 \(\rho(\vec{r})\)

这为什么有帮助? 因为对于一个无相互作用的体系,我们精确地知道它的动能泛函 \(T_s[\rho]\)!它就是所有单电子轨道 \(\phi_i\) 的动能之和。(\(s\) 代表 “single-particle” 或 “non-interacting”)。

2. 重写总能量泛函 \(E[\rho]\)

现在,Kohn-Sham 将真实体系的总能量 \(E[\rho]\) 重新组织为以下四项:

\(E[\rho] = T_s[\rho] + \int V_{ext}(\vec{r}) \rho(\vec{r}) d\vec{r} + E_H[\rho] + E_{xc}[\rho]\)

我们来逐项分析:

  1. \(T_s[\rho]\):无相互作用动能
    • 这是我们刚刚引入的虚拟体系的动能。我们可以精确计算它(通过求解轨道 \(\phi_i\))。
    • (注:这不是真实体系的动能 \(T[\rho]\),但它通常是 \(T[\rho]\) 的一个很好的近似。)
  2. \(\int V_{ext}(\vec{r}) \rho(\vec{r}) d\vec{r}\):外势能
    • 这是电子-原子核的吸引能。这个泛函是精确已知的(就像上一张白板展示的)。
  3. \(E_H[\rho]\):哈特里 (Hartree) 能量
    • \(E_H[\rho] = \frac{1}{2} \iint \frac{\rho(\vec{r})\rho(\vec{r}')}{|\vec{r}-\vec{r}'|} d\vec{r}d\vec{r}'\)
    • 这是电子密度与其自身相互作用的经典库仑排斥能。这个泛函也是精确已知的。
  4. \(E_{xc}[\rho]\):交换-相关 (Exchange-Correlation) 泛函
    • 这是 DFT 实践的核心!
    • \(E_{xc}[\rho]\) 被定义为一个“垃圾桶”,它包含了所有我们不知道的、以及我们故意用近似替换掉的所有复杂物理:
      • (\(T[\rho] - T_s[\rho]\)):真实动能与无相互作用动能之间的差值(即动能的相关部分)。
      • (\(V_{ee}[\rho] - E_H[\rho]\)):总电子排斥能与经典库仑排斥能之间的差值(即所有非经典的交换效应和相关效应)。

3. Kohn-Sham 方程:实践的工具

现在我们有了能量表达式。根据变分原理(Hohenberg-Kohn 第二定理),我们通过最小化 \(E[\rho]\) 来寻找基态密度 \(\rho(\vec{r})\)

对这个能量泛函 \(E[\rho]\) 应用变分法,最终会得到一组类似于薛定谔方程的单电子方程,这就是著名的 Kohn-Sham (KS) 方程

\(\left( -\frac{\hbar^2}{2m} \nabla^2 + V_{eff}(\vec{r}) \right) \phi_i(\vec{r}) = \varepsilon_i \phi_i(\vec{r})\)

  • \(\phi_i(\vec{r})\) 就是“Kohn-Sham 轨道”,电子密度由它们构成:\(\rho(\vec{r}) = \sum_i |\phi_i(\vec{r})|^2\)
  • \(V_{eff}(\vec{r})\) 是一个“有效势”,无相互作用的电子在这个势场中运动: \(V_{eff}(\vec{r}) = V_{ext}(\vec{r}) + V_H(\vec{r}) + V_{xc}(\vec{r})\)
    • \(V_{ext}\):原子核的势。
    • \(V_H\):电子间的经典库仑势(来自 \(E_H\))。
    • \(V_{xc}\)交换-相关势(来自 \(E_{xc}\))。

4. 唯一的近似:“泛函动物园”

Kohn-Sham 理论在形式上是精确的。 如果我们知道了精确的 \(E_{xc}[\rho]\) 泛函,我们将得到体系的精确基态能量和密度。

但在实践中,我们不知道精确的 \(E_{xc}[\rho]\)

因此,所有实用的 DFT 计算都变成了对 \(E_{xc}[\rho]\) 的近似。

这就是您可能听说过的所有“泛函”(functionals)的来源,它们是对 \(E_{xc}[\rho]\) 的不同近似,构成了所谓的“泛函动物园”(Functional Zoo):

  • LDA (局域密度近似): 最简单的近似,只依赖于 \(\rho\)
  • GGA (广义梯度近似): 依赖于 \(\rho\) 和它的梯度 \(\nabla\rho\) (例如 PBE, BLYP)。
  • Hybrid (杂化) 泛函: 混合了一部分 Hartree-Fock 的精确交换(例如 B3LYP, PBE0)。

总结:实践中的 DFT

  1. 选择一个近似的 \(E_{xc}[\rho]\) 泛函(例如 B3LYP)。
  2. 猜测一个初始的电子密度 \(\rho_{guess}\)
  3. 根据 \(\rho_{guess}\) 计算出有效势 \(V_{eff}\)
  4. 求解 Kohn-Sham 方程,得到一组新的轨道 \(\phi_i\)
  5. 根据新的 \(\phi_i\) 计算出一个新的电子密度 \(\rho_{new}\)
  6. 比较 \(\rho_{new}\)\(\rho_{guess}\)。如果它们足够接近,计算完成(“自洽”)。
  7. 如果不同,则混合新旧密度,返回第 3 步,循环迭代直到收敛。

Kohn-Sham 理论的伟大之处在于,它将一个极其复杂的 \(N\) 电子问题(如 \(10^{90}\) 维度),在数学上等价地转化为了一个(原则上精确的)求解 \(N\) 个电子在有效势场中运动的 3 维问题。

PHYS 5120 - 计算能源材料和电子结构模拟 Lecture

Lecturer: Prof.PAN DING

I:

1. 基本概念与定义

“HF theory” (Hartree-Fock 理论)

这是一种在量子化学中用于近似求解多电子体系(如分子)薛定谔方程的从头计算法 (ab initio method)

  • 核心思想: 它将复杂的多电子问题简化为一系列独立的单电子问题。它假设每个电子都在一个由原子核和其他所有电子共同产生的平均场中运动。
  • 波函数: 它使用一个斯莱特行列式 (Slater determinant) 来描述 N 电子体系的波函数,这自动满足了泡利不相容原理(即波函数在交换任意两个电子时反号)。

\(E_{HF} \neq \sum_{i} \epsilon_i\)

  • \(E_{HF}\): HF 总能量。这是整个 N 电子体系的总能量,包含了所有电子的动能、电子与原子核的吸引能、以及电子与电子之间的排斥能。
  • \(\epsilon_i\): 轨道能量 (Orbital Energy)。这是第 \(i\) 个电子在所有其他 \(N-1\) 个电子的平均场中所具有的能量。
  • \(\sum_{i} \epsilon_i\): 所有被占据轨道能量的总和。
  • 为什么不相等?:
    • 在计算总能量 \(E_{HF}\) 时,电子-电子排斥能 \(V_{ee}\) 只计算一次。
    • 在计算每个轨道能量 \(\epsilon_i\) 时,它包含了第 \(i\) 个电子与所有 \(j\) 电子(\(j \neq i\))的排斥。
    • 当你把所有的 \(\epsilon_i\) 相加时 (\(\sum_i \epsilon_i\)),每一对电子 (\(i\)\(j\)) 之间的排斥能被计算了两次(一次在 \(\epsilon_i\) 中,一次在 \(\epsilon_j\) 中)。
    • 正确的公式是: \(E_{HF} = \sum_{i} \epsilon_i - V_{ee}\),或者更准确地写为: \[E_{HF} = \sum_{i} \epsilon_i - \frac{1}{2} \sum_{i,j} (J_{ij} - K_{ij})\] (这里的 \(J_{ij}\)\(K_{ij}\) 是下面会讲到的库仑积分和交换积分)。这个 \(\frac{1}{2}\) 就是为了修正重复计算。

“band gap” (带隙)

这个图显示了两个关键的分子轨道: * HOMO: Highest Occupied Molecular Orbital (最高占据分子轨道)。这是在体系基态下,能量最高的、包含电子的轨道。 * LUMO: Lowest Unoccupied Molecular Orbital (最低未占分子轨道)。这是在体系基态下,能量最低的、没有电子的轨道。 * HOMO-LUMO 隙: \(E_{LUMO} - E_{HOMO}\)。在分子中,这通常被粗略地称为“带隙”,它大致对应于将电子从基态激发到第一激发态所需的最低能量(光学带隙)。

“fundamental band gap = I - A”

  • 解释: 这是基本带隙(或称准粒子带隙)的严格定义。它和 HOMO-LUMO 隙是不同的概念。
  • 它代表了在不考虑电子-空穴相互作用(激子效应)的情况下,产生一个分离的电子和一个空穴所需的能量。
  • \(I - A = [E(N-1) - E(N)] - [E(N) - E(N+1)] = E(N-1) + E(N+1) - 2E(N)\)

“ionization energy: I = E(N-1) - E(N) > 0”

  • 电离能 (I):
    • \(E(N)\): 具有 N 个电子的中性体系的基态总能量。
    • \(E(N-1)\): 移走一个电子后,具有 (N-1) 个电子的阳离子的基态总能量。
    • 定义: 从一个体系中移走一个电子所需的最小能量
    • 因为电子被原子核束缚,移走它需要外界提供能量,所以 \(I\) 总是大于 0。

“electron affinity: A = E(N) - E(N+1) > 0”

  • 电子亲和能 (A):
    • \(E(N+1)\): 增加一个电子后,具有 (N+1) 个电子的阴离子的基态总能量。
    • 定义: 一个中性体系获得一个电子并形成稳定阴离子时所释放的能量。
    • 注意: 如果 \(E(N+1) < E(N)\)(即阴离子更稳定),则 \(A > 0\),体系释放能量。如果阴离子不稳定(\(E(N+1) > E(N)\)),则 \(A < 0\)。白板上的 \(> 0\) 假设了形成稳定阴离子的情况。

2. Hartree-Fock 总能量公式

\(E_{HF}\)(HF 总能量)的完整表达式。它由三部分组成:

第一行:单电子能量 (One-Electron Energy)

\[\sum_{i\sigma} \int d\vec{r} \phi_i^*(\vec{r}) \left( -\frac{\hbar^2}{2m} \nabla^2 + V_{ext} \right) \phi_i(\vec{r})\]

  • \(\sum_{i\sigma}\): 对所有被占据的自旋轨道 (spin-orbital) \(i\) (自旋为 \(\sigma\)) 求和。
  • \(\int d\vec{r}\): 对空间坐标 \(\vec{r}\) 积分。
  • \(\phi_i(\vec{r})\): 第 \(i\) 个自旋轨道的波函数。\(\phi_i^*\) 是它的复共轭。
  • \(-\frac{\hbar^2}{2m} \nabla^2\): 动能算符\(\hbar\) 是约化普朗克常数,\(m\) 是电子质量,\(\nabla^2\) 是拉普拉斯算符。
  • \(V_{ext}\): 外势场。这是来自所有原子核对电子的库仑吸引势
  • 含义: 这一整行代表了所有电子的动能与它们受到原子核吸引势能的总和。

第二行:库仑项 (Coulomb Term, J)

\[+ \frac{e^2}{2} \sum_{i\sigma, j\sigma'} \iint d\vec{r} d\vec{r}' \frac{|\phi_i(\vec{r})|^2 |\phi_j(\vec{r}')|^2}{|\vec{r} - \vec{r}'|}\]

  • \(\frac{e^2}{2} \sum_{i\sigma, j\sigma'}\): 对所有电子对 (\(i\)\(j\),无论自旋 \(\sigma\)\(\sigma'\) 是否相同) 求和。\(e\) 是电子电荷。
  • \(\frac{1}{2}\): 修正因子,因为求和时 (\(i,j\)) 和 (\(j,i\)) 被计算了两次,而它们是同一种相互作用。
  • \(|\phi_i(\vec{r})|^2\): 电子 \(i\)\(\vec{r}\) 处出现的概率密度(即电荷密度)。
  • \(\frac{...}{|\vec{r} - \vec{r}'|}\): 两个点电荷之间的库仑排斥。
  • 含义: 这是电子 \(i\) 的电荷云和电子 \(j\) 的电荷云之间的经典静电排斥能。这是一个纯粹的经典概念。

第三行:交换项 (Exchange Term, K)

\[- \frac{e^2}{2} \sum_{i\sigma, j\sigma} \iint d\vec{r} d\vec{r}' \frac{\phi_i^*(\vec{r}) \phi_j(\vec{r}) \phi_j^*(\vec{r}') \phi_i(\vec{r}')}{|\vec{r} - \vec{r}'|}\] (注意:白板上的公式在 \(\phi\) 的变量上似乎有些笔误,这里写的是标准形式) * \(\sum_{i\sigma, j\sigma}\): 关键区别! 这里的求和只对自旋相同 (\(\sigma\)) 的电子对进行。 * 含义: 这是一个纯粹的量子力学效应,没有经典对应。它源于波函数必须满足泡利不相容原理(反对称性)。 * 它修正了库仑项。由于泡利不相容,自旋相同的电子有“避开”彼此的倾向(称为费米空穴 (Fermi hole)),这导致它们的实际排斥能小于经典的库仑排斥能。交换项 \(K\) 是一个正值,因此在总能量中它是一个负的贡献(使能量更低,体系更稳定)。

3. 核心对比:\(\Delta SCF\) 与 Koopmans 定理

核心是比较计算 \(I\)\(A\) 的两种方法。

\(\Delta SCF\)” (Delta SCF)

  • 含义: “\(\Delta\)” (Delta) 指的是差值,“SCF” (Self-Consistent Field, 自洽场) 是求解 HF 方程的计算过程。
  • 方法:
    1. 对 N 电子体系进行一次完整的 SCF 计算,得到总能量 \(E(N)\)
    2. 对 (N-1) 电子体系(阳离子)进行另一次完整的 SCF 计算,得到总能量 \(E(N-1)\)
    3. \(I = E(N-1) - E(N)\)
  • 特点:
    • 准确: 这是在 HF 理论框架内计算电离能的最准确方法。
    • 计算量大: 需要进行两次(或多次)独立的、昂贵的 SCF 计算。

“Koopmans’ theorem” (Koopmans 定理)

  • 含义: 这是一个近似方法,它将电离能/电子亲和能与单次 HF 计算得到的轨道能量直接关联起来。
  • 定理内容:
    • 电离能: \(I \approx -\epsilon_{HOMO}\) (电离能约等于 HOMO 轨道能量的负值)
    • 电子亲和能: \(A \approx -\epsilon_{LUMO}\) (电子亲和能约等于 LUMO 轨道能量的负值)
  • 特点:
    • 近似: 结果不如 \(\Delta SCF\) 准确。
    • 计算量小: 只需要对 N 电子体系进行一次 SCF 计算,就能“免费”得到所有轨道能量,从而估算出 \(I\)\(A\)

“frozen orbital approximation” (冻结轨道近似)

  • 含义: 这是 Koopmans 定理背后的核心假设
  • 内容: 它假设当你从 N 电子体系中移出一个电子(例如从 HOMO 移出)后,剩下的 (N-1) 个电子的轨道(\(\phi\)完全不发生任何改变(即被“冻结”了)。
  • 推导: 白板上的 \(-I = E(N) - E(N-1)\) 以及指向 \(E_{HF}\) 公式的箭头,就是在演示这个推导。如果你假设 \(E(N-1)\) 只是 \(E(N)\) 的公式中去掉了所有与 HOMO (设为 \(k\) 轨道) 相关的项,那么 \(E(N) - E(N-1)\) 经过数学推导后恰好等于 \(\epsilon_k\)\(k\) 轨道的轨道能量)。因此 \(I = -\epsilon_k\)

“orbital relaxation” (轨道弛豫)

  • 含义: 这是 Koopmans 定理出错的原因,也是“冻结轨道近似”所忽略的物理现实。
  • 内容:
    • 真实情况是:当你移出一个电子后,电子-电子排斥减少了。
    • 剩下的 (N-1) 个电子会“感觉”到更强的来自原子核的净吸引力,导致它们的轨道会收缩重新排布(即“弛豫”),以达到一个新的、能量更低的稳定状态。
    • 因此,通过 \(\Delta SCF\) 计算得到的真实 \(E(N-1)\)(弛豫后的能量)总是低于 Koopmans 定理所假设的 \(E(N-1)_{frozen}\)(冻结轨道的能量)。
  • 结论:
    • \(I_{\Delta SCF} = E(N-1)_{relaxed} - E(N)\)
    • \(I_{Koopmans} = E(N-1)_{frozen} - E(N)\)
    • 因为 \(E(N-1)_{relaxed} < E(N-1)_{frozen}\),所以 \(I_{\Delta SCF} < I_{Koopmans}\)
    • 换句话说,Koopmans 定理总是高估(overestimate)电离能

总结

  • \(\Delta SCF\): 准确(在 HF 理论内)、计算量大。它通过计算两个不同体系的总能量之差来求 \(I\),并包含了轨道弛豫效应。
  • Koopmans 定理: 近似、计算量小。它通过一次计算的轨道能量来估算 \(I\),它基于冻结轨道近似忽略了轨道弛豫

Koopmans 定理的数学推导,并将其与带隙 (band gap) 的概念联系起来。

II:

1. Koopmans 定理的数学推导(电离能 I)

在数学上证明为什么 \(I \approx -\epsilon_{HOMO}\)

  • 起点: \(-I = E(N) - E(N-1)\)
    • 这是电离能 (I) 定义的变形。我们现在要计算的是,在冻结轨道近似下,N 电子体系的总能量 \(E(N)\) 与 (N-1) 电子体系的总能量 \(E(N-1)_{frozen}\) 之差。
    • 我们假设被移走的电子来自 HOMO 轨道,我们称之为轨道 \(k\)(或白板上的 \(N\))。
  • 中间的大型公式:
    • \(E(N) - E(N-1)_{frozen}\) 的结果就是白板上写的这些项。当你从总能量 \(E_{HF}\)(在第一张图中有)中减去一个“冻结”的 (N-1) 体系能量时,你剩下的恰好是与被移走的那个电子(来自轨道 \(k\))相关的所有能量项。
    • 这些项是:
      1. 单电子能量: \(\langle \phi_k | \hat{h} | \phi_k \rangle\) (白板上用 \(\hat{f}(\vec{r})\) 等符号表示)。这是 \(k\) 电子自身的动能和它受到的所有原子核的吸引能。
      2. 库仑项 (J): \(+ e^2 \sum_{j} \iint ...\)。这是 \(k\) 电子与所有其他 \(j\) 电子之间的经典库仑排斥能。
      3. 交换项 (K): \(- e^2 \sum_{j} \iint ...\)。这是 \(k\) 电子与所有同自旋\(j\) 电子之间的量子交换能。
  • 关键等式: \(\hat{f}_i \phi_i = \epsilon_i \phi_i\)
    • 这是 Hartree-Fock 方程。它定义了轨道能量 \(\epsilon_i\)
    • \(\hat{f}_i\)Fock 算符,它本身就包含了上述的三部分能量:单电子能量算符 (\(\hat{h}\))、所有其他电子的库仑算符 (\(\hat{J}\)) 和交换算符 (\(\hat{K}\))。
    • 因此,上面那一大堆积分(\(k\) 电子的单电子能量 + 它与所有其他电子的 J 和 K 相互作用)根据定义,就等于 \(\epsilon_k\)
  • 推导结论: \(= \epsilon_k = \epsilon_{HOMO}\)
    • 因为 \(E(N) - E(N-1)_{frozen} = \epsilon_k\)\(k\) 是 HOMO 轨道)。
    • 所以 \(-I = \epsilon_{HOMO}\)
    • 最终得到: \(I = -\epsilon_{HOMO}\)
    • 总结: Koopmans 定理的推导,其核心就是冻结轨道近似。这个近似使得 \(E(N)\)\(E(N-1)\) 之间的能量差,恰好等于被移走的那个电子的轨道能量

2. Koopmans 定理(电子亲和能 A)

  • \(-A = E(N+1) - E(N)\):
    • 这是电子亲和能 \(A = E(N) - E(N+1)\) 的变形。
    • 我们现在考虑在冻结轨道近似下,向 N 电子体系的 LUMO 轨道加入一个电子。
  • \(= \epsilon_{LUMO}\):
    • 逻辑: 同样使用冻结轨道近似,我们假设 N 电子体系中所有原来的轨道在加入新电子后不发生改变(不弛豫)。
    • 那么,(N+1) 电子体系的总能量 \(E(N+1)_{frozen}\) 与 N 电子体系的总能量 \(E(N)\) 之差,就恰好等于这个新电子被加入的那个轨道(即 LUMO)的轨道能量
    • 因此,\(-A = \epsilon_{LUMO}\)
    • 最终得到: \(A = -\epsilon_{LUMO}\)

3. 带隙 (Band Gap)

  • E_{gap} = I - A:
    • 这是基本带隙 (Fundamental Gap)严格定义
    • 它代表了从体系中移出一个电子并将其放置到远离体系的无穷远处,然后再从无穷远处拿一个电子放回体系中(假设的),这两个过程的能量差。
    • 它是一个总能量 (Total Energy) 的差值,需要三次 \(\Delta SCF\) 计算(\(E(N)\), \(E(N-1)\), \(E(N+1)\))才能精确得到。
  • = \epsilon_{LUMO} - \epsilon_{HOMO}:
    • 这是Koopmans 定理对基本带隙的近似
    • 通过代入我们刚刚推导出的近似值:
      • \(I \approx -\epsilon_{HOMO}\)
      • \(A \approx -\epsilon_{LUMO}\)
    • 我们得到:\(E_{gap} = I - A \approx (-\epsilon_{HOMO}) - (-\epsilon_{LUMO}) = \epsilon_{LUMO} - \epsilon_{HOMO}\)
    • 关键结论: \(\epsilon_{LUMO} - \epsilon_{HOMO}\)(即 HOMO-LUMO 隙)是基本带隙 \(I - A\) 的一个近似值。
    • 为什么是近似? 因为它完全忽略了计算 \(I\)\(A\) 时的轨道弛豫效应。

4. 右下角的图示

  • HOMO / LUMO: 显示了最高占据和最低未占两个轨道能级。
  • ~1.1 - 1.3 eV: 这是一个示例数值,用来表示 HOMO-LUMO 隙的典型大小(这个数值接近于硅的带隙,可能是一个具体的例子)。
  • 上面的竖线: 代表了 LUMO 之上的一系列能量更高、未被占据的虚拟轨道 (virtual orbitals)。在固体物理中,这对应于导带 (Conduction Band)
  • HOMO 下的能级 (未画出): 代表了所有能量更低、已被占据的轨道。在固体物理中,这对应于价带 (Valence Band)

总结

以上是Koopmans 定理的数学论证,并得出了一个在计算化学和材料科学中非常重要(虽然是近似的)的结论:

基本带隙 (\(I - A\)) 可以通过单次 HF 计算得到的 HOMO-LUMO 隙 (\(\epsilon_{LUMO} - \epsilon_{HOMO}\)) 来估算。

这个近似的准确性取决于 “冻结轨道近似” 带来的误差(忽略轨道弛豫)与 HF 理论本身带来的误差(忽略电子相关能)在多大程度上相互抵消。

III:

“后-HF” (post-Hartree-Fock) 方法

在前面的讨论中,我们确定了 HF 理论的两个主要缺陷: 1. 忽略了轨道弛豫(导致 Koopmans 定理不准)。 2. 忽略了电子相关能(HF 是一个平均场理论,没有考虑电子的瞬时“躲避”行为)。

介绍如何修正第二个(也是更根本的)缺陷。这个方法叫做 “Configuration Interaction (CI)” (组态相互作用)

1. 📖 \(\Phi_{HF} = \frac{1}{\sqrt{N!}} | \dots |\) (Slater Determinant)

  • 含义: 这是 Hartree-Fock (HF) 波函数 (\(\Phi_{HF}\)) 的数学定义
  • \(\frac{1}{\sqrt{N!}}\): 归一化常数。
  • \(| \dots |\): 这是一个斯莱特行列式 (Slater Determinant)
    • 例如,对于一个 2 电子体系(如氦原子),它写作: \[\Phi(1,2) = \frac{1}{\sqrt{2}} \begin{vmatrix} \phi_1(1) & \phi_2(1) \\ \phi_1(2) & \phi_2(2) \end{vmatrix} = \frac{1}{\sqrt{2}} [\phi_1(1)\phi_2(2) - \phi_1(2)\phi_2(1)]\]
    • \(\phi_1(1)\) 表示电子 1 处于 \(\phi_1\) 自旋轨道。
  • 目的: 这种行列式形式是满足泡利不相容原理(波函数在交换任意两个电子时必须反号)的最简洁的数学工具。

HF 理论假设体系的真实波函数 \(\Psi\) 可以被单个斯莱特行列式 \(\Phi_{HF}\) 很好地近似

2. ⚡ “Configuration Interaction (CI)” (组态相互作用)

  • 核心思想: HF 理论的“平均场”假设(即 \(\Psi \approx \Phi_{HF}\))是一个近似。一个更精确的波函数 \(\Psi\) 应该是一个线性组合,它不仅包含 HF 基态组态 \(\Phi_{HF}\),还包含所有可能的激发组态 (excited configurations)
  • 能级图:
    • 图显示了 HF 基态。轨道 1 和 2 被电子占据(占据轨道, occ),轨道 3, 4, 5… 是空的(虚拟轨道, virt)。
    • 这个基态组态就是 \(\Phi_{HF}\)
  • 什么是激发组态?
    • 单激发 (Singles, S): 将一个电子从一个占据轨道 (\(i\)) 激发到一个虚拟轨道 (\(a\))。记作 \(\Phi_i^a\)
    • 双激发 (Doubles, D): 将两个电子从占据轨道 (\(i, j\)) 激发到虚拟轨道 (\(a, b\))。记作 \(\Phi_{ij}^{ab}\)
    • 三激发 (Triples, T): …以此类推。
  • CI 波函数: 真正的基态波函数 \(\Psi\) 是所有这些可能组态的叠加: \[\Psi = C_0 \Phi_{HF} + \sum_{i,a} C_i^a \Phi_i^a + \sum_{i,j,a,b} C_{ij}^{ab} \Phi_{ij}^{ab} + \sum_{i,j,k,a,b,c} C_{ijk}^{abc} \Phi_{ijk}^{abc} + \dots\]
    • \(C_0\) 是 HF 基态的系数(通常接近 1)。
    • \(C_i^a\), \(C_{ij}^{ab}\)… 是各种激发组态的系数(或“权重”)。
    • CI 方法的目的就是通过求解一个巨大的矩阵本征值问题来找出所有这些 \(C\) 系数,从而获得更精确的波函数 \(\Psi\) 和能量 \(E\)

3. CI 公式

总波函数 \(\Psi\) 的各个组成部分:

  • \(\Psi^{(1)} = \sum_{i}^{\text{occ}} \sum_{a}^{\text{virt}} C_i^a \Phi_i^a\)
    • 这是波函数中所有单激发 (S) 组态的总和。
  • \(\Psi^{(2)} = \sum_{i<j}^{\text{occ}} \sum_{a<b}^{\text{virt}} C_{ij}^{ab} \Phi_{ij}^{ab}\)
    • 这是波函数中所有双激发 (D) 组态的总和。
  • \(\Psi^{(3)} = \sum_{i<j<k}^{\text{occ}} \sum_{a<b<c}^{\text{virt}} C_{ijk}^{abc} \Phi_{ijk}^{abc}\)
    • 这是波函数中所有三激发 (T) 组态的总和。

4. 🔑 关键概念

  • \(\langle \Phi_{HF} | \Phi_i^a \rangle = 0\)
    • 含义: HF 基态波函数与任意一个单激发态波函数都是正交的(即它们的重叠积分为 0)。
    • 重要推论 (Brillouin 定理): 这意味着 HF 基态 \(\Phi_{HF}\) 不会与单激发态 \(\Phi_i^a\) 直接混合。换句话说,在 CI 波函数中,所有的 \(C_i^a\) 系数都将为 0(如果 \(\Phi_{HF}\) 是严格的 HF 波函数)。
    • 结论: 对 HF 基态能量的第一个修正来自于双激发 (Doubles)。这就是为什么双激发在电子相关能中如此重要。
  • Full CI (FCI, 全组态相互作用)
    • 如果你在 CI 展开式中包含了所有可能的激发(单、双、三、…直到 N 激发),你就得到了 Full CI
    • FCI 在给定的基组下,是求解薛定谔方程的精确解
    • 问题: 计算量是天文数字(随体系大小呈指数增长),只能用于几个电子的微小体系。
  • Truncated CI (截断 CI)
    • 由于 FCI 不可行,人们在实际中会截断这个求和。
    • CISD: 只包含单激发 (S)双激发 (D)。这是最常见的 CI 方法之一。
    • CISDT: 包含单、双、三激发。

总结

从 HF 理论(一个单行列式近似)转向了 CI 理论(一个多行列式方法)。

CI 的核心目的:通过将激发态组态(\(\Phi_i^a\), \(\Phi_{ij}^{ab}\) 等)“混合”到 HF 基态组态 (\(\Phi_{HF}\)) 中,来系统地恢复 HF 理论所忽略的电子相关能

“二次量子化” (Second Quantization)

它是一种用来处理多粒子系统(如此处的电子)的强大工具,特别擅长描述粒子的产生和湮灭。

1. 核心工具:产生与湮灭算符

在二次量子化中,我们不再纠结于写出完整的、庞大的斯莱特行列式(像 \(\Phi_{HF}\) 那样),而是定义两个基本的操作符:

  • \(a_p^\dagger\) (产生算符, Creation Operator):
    • 作用: 当它作用在一个波函数上时,它会在轨道 \(p\)产生一个电子。
    • 例如,如果 \(p\) 轨道是空的, \(a_p^\dagger\) 会把一个电子放进去。
    • 如果 \(p\) 轨道已经被占据,根据泡利不相容原理, \(a_p^\dagger |\phi_p\rangle = 0\) (你不能在同一个自旋轨道上放两个电子)。
  • \(a_p\) (湮灭算符, Annihilation Operator):
    • 作用: 当它作用在一个波函数上时,它会从轨道 \(p\)湮灭(或移除)一个电子。
    • 例如,如果 \(p\) 轨道是占据的, \(a_p\) 会把这个电子移走。
    • 如果 \(p\) 轨道本来就是空的, \(a_p |\text{empty}\rangle = 0\) (你不能从一个空轨道中移走电子)。

2. 定义基态 \(\Phi_{HF}\)

  • 首先,我们定义一个真正的“真空态” (vacuum state),记作 \(|0\rangle\),表示一个完全没有电子的空荡荡的体系。
  • 那么,HF 基态波函数 \(\Phi_{HF}\)(即所有占据轨道 \(i, j, k, \dots\) 都被填满的状态)就可以通过从真空态开始,不断“产生”电子来构建: \[|\Phi_{HF}\rangle = a_1^\dagger a_2^\dagger a_3^\dagger \dots a_N^\dagger |0\rangle\] (这里 \(1, 2, \dots, N\) 是所有被占据的轨道)。

3. 解释

。它们展示了如何从已知的 HF 基态 \(|\Phi_{HF}\rangle\) 构建出各种激发态。

| $\Phi_i^a$ > = $a_a^\dagger a_i$ | $\Phi_{HF}$ >

  • 含义: 这个公式在说:“单激发态 \(\Phi_i^a\) 是如何得到的?”
  • 操作:
    1. 从 HF 基态 \(|\Phi_{HF}\rangle\) 开始。
    2. 首先,应用湮灭算符 \(a_i\)。这会从占据轨道 \(i\)移走一个电子。
    3. 然后,应用产生算符 \(a_a^\dagger\)。这会在虚拟轨道 \(a\)产生一个电子。
  • 结果: 这个两步操作(“先湮灭 \(i\),再产生 \(a\)”)的效果,就是将一个电子从轨道 \(i\) 激发到了轨道 \(a\)。这正是单激发态 \(\Phi_i^a\) 的定义。

| $\Phi_{ij}^{ab}$ > = $a_a^\dagger a_b^\dagger a_j a_i$ | $\Phi_{HF}$ >

(白板上这个公式被部分遮挡了,但这是它的标准形式)

  • 含义: 这个公式在说:“双激发态 \(\Phi_{ij}^{ab}\) 是如何得到的?”
  • 操作:
    1. 从 HF 基态 \(|\Phi_{HF}\rangle\) 开始。
    2. 应用 \(a_i\):从占据轨道 \(i\)移走一个电子。
    3. 应用 \(a_j\):从占据轨道 \(j\)移走另一个电子。
    4. 应用 \(a_b^\dagger\):在虚拟轨道 \(b\)产生一个电子。
    5. 应用 \(a_a^\dagger\):在虚拟轨道 \(a\)产生另一个电子。
  • 结果: 这个四步操作(“湮灭 \(i, j\);产生 \(a, b\)”)的效果,就是将两个电子从轨道 \(i, j\) 激发到了轨道 \(a, b\)。这正是双激发态 \(\Phi_{ij}^{ab}\) 的定义。

4. 为什么这个方法更好?

  • 简洁性: 你不需要再写出庞大的 \(N \times N\) 行列式。你只需要简单地写 \(a_a^\dagger a_i |\Phi_{HF}\rangle\) 就可以精确地代表那个单激发的斯莱特行列式。
  • 自动处理“交换 (exchange)”: 白板上提到了 “exchange”。这些产生/湮灭算符被定义为费米子算符,它们自动满足一个称为“反对易关系” (anti-commutation relation) 的规则。
    • 例如:\(a_i a_j = -a_j a_i\)\(a_i^\dagger a_j^\dagger = -a_j a_i^\dagger\)
    • 这个负号自动地包含了泡利不相容原理和波函数的反对称性(即“交换”效应)。你不需要再手动去操心行列式的正负号问题。

总结: 二次量子化的语言,它是一种更强大、更数学化的方式,用来描述 CI(组态相互作用)理论中如何从 HF 基态构建出各种激发态。

IV:

CI(组态相互作用)理论的总结性笔记。

1. 波函数基组

{ $\Phi_{HF}, \Phi_i^a, \Phi_{ij}^{ab}, \Phi_{ijk}^{abc}, \dots$ }

  • 含义: 这是一个集合 (set)。它代表了在给定的单电子轨道基组下,所有可能构建出来的 N 电子组态 (configurations)
  • \(\Phi_{HF}\): HF 基态组态(即电子占据能量最低的 N 个轨道)。
  • \(\Phi_i^a\): 所有的单激发 (Singles) 组态。
  • \(\Phi_{ij}^{ab}\): 所有的双激发 (Doubles) 组态。
  • \(\Phi_{ijk}^{abc}\): 所有的三激发 (Triples) 组态。
  • : 一直持续到 N 激发(即所有电子都被激发)。
  • 重要性: 这个无穷(但在有限基组下是有限的)集合构成了一个完备的 N 电子基组。这意味着任何一个真实的 N 电子波函数 \(\Psi\)(即薛定谔方程的精确解),都可以被精确地表示为这个集合中所有组态的线性组合: \[\Psi = C_0 \Phi_{HF} + \sum C_i^a \Phi_i^a + \sum C_{ij}^{ab} \Phi_{ij}^{ab} + \dots\]

2. “exchange” (交换) vs “correlation” (相关)

  • “exchange” (交换能):
    • 白板上将 “exchange” 指向了 HF 基态 \(\Phi_{HF}\)
    • 含义: 这是 HF 理论已经包含在内的量子效应。
    • 它源于泡利不相容原理,即两个自旋相同的电子不能占据同一空间位置。这导致同自旋电子之间存在一个“费米空穴 (Fermi hole)”,使它们倾向于“避开”彼此。
    • 在 HF 能量公式中(见第一张白板),这表现为那个负的交换积分 (-K) 项。
  • “correlation” (相关能):
    • 白板上将 “correlation” 指向了所有激发态\(\Phi_i^a, \Phi_{ij}^{ab}, \dots\))。
    • 含义: 这是 HF 理论所忽略的效应。
    • HF 的缺陷: HF 只考虑了同自旋电子的“交换”相关,但它忽略了不同自旋电子之间的瞬时相互作用。在 HF(平均场)理论中,一个自旋向上的电子只感受到一个自旋向下电子云的平均排斥,而没有根据它俩的瞬时位置来“躲避”对方。
    • CI 的修正: 通过在波函数中混入激发态(如 \(\Phi_{ij}^{ab}\)),CI 方法允许电子(包括不同自旋的电子)的运动相互关联起来,从而更有效地“躲避”彼此,进一步降低体系的总能量。
    • 定义: 相关能 = 真实能量 - HF 能量
    • 结论: CI 方法(以及其他 post-HF 方法)的目的就是为了“恢复” HF 理论所丢失的电子相关能

3. 能量图示

能量图:

  • \(\uparrow\): 能量轴,能量向上增加。
  • HF: Hartree-Fock 能量(\(E_{HF}\)
    • 这是通过单个斯莱特行列式 \(\Phi_{HF}\) 计算得到的能量。
    • 这是一个近似能量,它高于真实的基态能量。
  • Full CI: 全组态相互作用能量(\(E_{FCI}\)
    • 这是通过将 \(\Psi\) 表示为所有可能组态(\(\Phi_{HF}, \Phi_i^a, \dots\))的线性组合,并求解得到的精确基态能量(在给定基组下的)。
    • \(E_{FCI}\) 是我们能得到的最低、最准确的能量
  • { (之间的能量差):
    • \(E_{FCI}\)\(E_{HF}\) 之间的能量差,正是我们前面定义的“电子相关能 (Correlation Energy)”
    • \[E_{\text{corr}} = E_{FCI} - E_{HF}\]

4. 二次量子化(复习)

  • | $\Phi_i^a$ > = $a_a^\dagger a_i$ | $\Phi_{HF}$ > (单激发)
  • | $\Phi_{ij}^{ab}$ > = $a_a^\dagger a_b^\dagger a_j a_i$ | $\Phi_{HF}$ > (双激发)
    • 这重申了上一张白板的概念:使用产生 (\(a^\dagger\))湮灭 (\(a\)) 算符,是从 HF 基态构建所有其他激发态的最简洁的数学方式。
  • { $a_i, a_j^\dagger$ } = $\delta_{ij}$
    • 含义: 这是费米子算符的反对易关系 (Anti-commutation relation)
    • {A, B} 是一个简写,表示 \(AB + BA\)
    • 所以,这个公式的完整形式是:\(a_i a_j^\dagger + a_j^\dagger a_i = \delta_{ij}\)
    • \(\delta_{ij}\): 克罗内克 delta 符号 (Kronecker delta)
      • 如果 \(i = j\)\(\delta_{ij} = 1\)
      • 如果 \(i \neq j\)\(\delta_{ij} = 0\)
    • 物理意义:
      • 如果 \(i \neq j\): \(a_i a_j^\dagger = -a_j^\dagger a_i\)。这表示“在 \(j\) 轨道产生一个电子,再在 \(i\) 轨道湮灭一个电子”与“先在 \(i\) 湮灭,再在 \(j\) 产生”这两个操作的顺序是相反的(导致波函数变号)。
      • 如果 \(i = j\): \(a_i a_i^\dagger + a_i^\dagger a_i = 1\)。这个规则用于计算占据数,并确保了泡利不相容原理的正确执行。

总结

清晰地总结了: 1. HF 理论只是一种近似,它包含了交换能。 2. 为了获得精确解 (Full CI),必须将 HF 基态与所有可能的激发态相混合。 3. 这个混合过程所恢复的能量,就是 HF 理论所忽略的电子相关能。 4. 二次量子化(\(a^\dagger, a\) 算符)是实现这一目标的标准数学工具。

这一系列白板笔记非常连贯地从 HF 理论的基础,讲到了 Koopmans 定理的近似,最后引入了更高级的 CI 方法来修正 HF 的根本缺陷(即电子相关能)。

量子化学指南:从 Hartree-Fock 到组态相互作用

这份指南涵盖了量子化学中的两个核心主题: 1. Hartree-Fock (HF) 理论:一种基础的、近似求解多电子体系的方法。 2. 组态相互作用 (CI):一种 “后-HF” 方法,用于系统地修正 HF 理论的缺陷,以逼近精确解。

第一部分:Hartree-Fock (HF) 理论基础

HF 理论是求解多电子体系薛定谔方程的第一个、也是最重要的平均场 (Mean-Field) 近似。

1. 核心思想与波函数

  • 核心思想:HF 理论将复杂的多电子问题(电子的运动是相互关联的)简化为一系列独立的单电子问题。它假设每个电子都在一个由原子核和其他所有电子共同产生的平均静电场中运动。
  • 波函数 (\(\Phi_{HF}\)):HF 理论使用单个斯莱特行列式 (Slater Determinant) 来描述 N 电子体系的基态波函数。 \[\Phi_{HF} = \frac{1}{\sqrt{N!}} \begin{vmatrix} \phi_1(1) & \phi_2(1) & \dots \\ \phi_1(2) & \phi_2(2) & \dots \\ \vdots & \vdots & \ddots \end{vmatrix}\] 这种形式自动满足了泡利不相容原理(即波函数在交换任意两个电子时反号)。

2. HF 总能量 (\(E_{HF}\))

HF 体系的总能量 \(E_{HF}\)(如第一张白板上的公式所示)由三部分组成:

  1. 单电子能量:所有电子的动能,以及它们与所有原子核的吸引势能。
  2. 库仑能 (J):电子 \(i\) 的电荷云与电子 \(j\) 的电荷云之间的经典静电排斥能
  3. 交换能 (K):一个纯粹的量子效应,源自泡利不相容原理。它仅存在于自旋相同的电子之间。这个能量项是负值,它降低了体系的总能量,可以理解为同自旋电子有“避开”彼此的倾向(费米空穴, Fermi Hole)。

3. 轨道能量 (\(\epsilon_i\)) vs 总能量 (\(E_{HF}\))

  • 轨道能量 (\(\epsilon_i\)):是在求解 HF 方程(\(\hat{f}\phi_i = \epsilon_i\phi_i\))时得到的本征值。它代表了电子 \(i\)所有其他 (N-1) 个电子的平均场中的能量。
  • 关键区别:总能量 不等于 轨道能量之和。 \[E_{HF} \neq \sum_{i} \epsilon_i\] 原因:在 \(\sum \epsilon_i\) 中,每一对电子之间的排斥能(库仑和交换)都被计算了两次。正确的公式是 \(E_{HF} = \sum \epsilon_i - V_{ee}\),其中 \(V_{ee}\) 是电子-电子排斥能项,用于修正重复计算。

第二部分:物理可观测量 (I 和 A)

HF 理论计算完成后,我们希望从中提取有物理意义的数据,例如电离能 (I) 和电子亲和能 (A)。有两种主要方法:

1. \(\Delta\)SCF (Delta-SCF) 方法:总能量之差

这是在 HF 理论框架内最准确的方法。“\(\Delta\)” (Delta) 指的就是”差值”。

  • 电离能 (I):从 N 电子体系移出一个电子所需的能量。 > \(I = E(N-1) - E(N)\)
    • 计算:你需要分别对 N 电子体系和 (N-1) 电子体系(阳离子)进行两次独立的 SCF 计算,然后求能量差。
  • 电子亲和能 (A):N 电子体系获得一个电子所释放的能量。 > \(A = E(N) - E(N+1)\)
    • 计算:你需要分别对 N 电子体系和 (N+1) 电子体系(阴离子)进行两次独立的 SCF 计算。
  • 基本带隙 (Fundamental Gap):被严格定义为 \(I - A\)。 > \(E_{gap} = I - A = E(N-1) + E(N+1) - 2E(N)\)

2. Koopmans 定理:轨道能量近似

这是一个非常有用,但计算上更便宜近似方法。它仅需一次 N 电子体系的 SCF 计算。

  • 定理内容
    • \(I \approx -\epsilon_{HOMO}\) (电离能约等于 HOMO 轨道能量的负值)
    • \(A \approx -\epsilon_{LUMO}\) (电子亲和能约等于 LUMO 轨道能量的负值)
  • 带隙近似:因此,基本带隙 \(I - A\) 可以被 HOMO-LUMO 隙 所近似。 > \(E_{gap} = I - A \approx (-\epsilon_{HOMO}) - (-\epsilon_{LUMO}) = \epsilon_{LUMO} - \epsilon_{HOMO}\)

3. 为什么 \(\Delta\)SCF 和 Koopmans’ 定理不同?

答案在于两个关键概念:

  • “Frozen Orbital Approximation” (冻结轨道近似): Koopmans 定理的数学假设。它假设当你移走(或加入)一个电子时,所有其他 (N-1) 个电子的轨道波函数完全不发生改变
  • “Orbital Relaxation” (轨道弛豫)物理现实。当你移走一个电子,电子间排斥力减小,剩下的 (N-1) 个电子会“感觉”到更强的核吸引力,它们的轨道会收缩重新排布(即“弛豫”),以达到一个新的、能量更低的稳定状态。

结论: * \(\Delta\)SCF 方法考虑了轨道弛豫,因为它为 (N-1) 体系重新进行了完整的计算。 * Koopmans’ 定理忽略了轨道弛豫。 * 由于弛豫会使 (N-1) 体系的能量进一步降低,Koopmans’ 定理(\(I \approx -\epsilon_{HOMO}\)总是高估真实的电离能。

第三部分:HF 理论的根本缺陷(相关能)

HF 理论本身就是一个近似,它最大的缺陷是忽略了电子相关能 (Electron Correlation Energy)

  • “Exchange” (交换能):HF 理论已经包含。它只解决了同自旋电子(如 \(\uparrow \uparrow\))因泡利不相容原理而相互“避开”的问题(费米空穴)。
  • “Correlation” (相关能):HF 理论完全忽略。这是指所有电子(特别是不同自旋的电子,如 \(\uparrow \downarrow\))为了瞬时地“躲避”彼此而产生的运动关联(库仑空穴, Coulomb Hole)。HF 作为一个平均场理论,只让电子感受到彼此的“平均”电荷云。

定义:电子相关能 \(E_{\text{corr}}\) 是体系的精确基态能量 \(E_{\text{exact}}\)HF 基态能量 \(E_{HF}\) 之间的差值。

\(E_{\text{corr}} = E_{\text{exact}} - E_{HF}\)

  • 由于 HF 理论(平均场)高估了电子排斥,HF 能量总是高于真实能量,所以相关能 \(E_{\text{corr}}\) 永远是负值

第四部分:组态相互作用 (CI) —— 修正 HF

CI 是一种系统地“恢复” HF 所丢失的相关能的方法。

1. 核心思想

HF 理论假设基态波函数只是一个 \(\Phi_{HF}\) 行列式。 CI 理论认为,精确的基态波函数 \(\Psi\) 应该是所有可能的 N 电子组态的线性叠加

这个组态的完备基组包括: * \(\Phi_{HF}\) (HF 基态组态) * \(\Phi_i^a\) (单激发组态:电子 \(i \to a\)) * \(\Phi_{ij}^{ab}\) (双激发组态:电子 \(i, j \to a, b\)) * \(\Phi_{ijk}^{abc}\) (三激发组态) * … 一直到 N 激发

2. CI 波函数

精确的波函数 \(\Psi\) 被展开为:

\(\Psi = C_0 \Phi_{HF} + \sum_{i,a} C_i^a \Phi_i^a + \sum_{i,j,a,b} C_{ij}^{ab} \Phi_{ij}^{ab} + \sum_{i,j,k,a,b,c} C_{ijk}^{abc} \Phi_{ijk}^{abc} + \dots\)

  • \(C_0, C_i^a, C_{ij}^{ab} \dots\) 是混合系数,表示每个组态对真实波函数的“贡献”大小。
  • Full CI (FCI, 全组态相互作用):如果这个展开式包含了所有可能的激发态,那么它就是在给定基组下的精确解
  • Truncated CI (截断 CI):由于 FCI 的计算量是天文数字,实际中通常会截断,例如 CISD(只包含单激发和双激发)。

3. 能量总结

白板上的能量图清晰地展示了这一点:

  • \(E_{HF}\):HF 能量,是 \(E_{\text{exact}}\) 的一个上限(近似值)。
  • \(E_{Full CI}\):Full CI 能量,是精确值 \(E_{\text{exact}}\)
  • \(E_{HF} - E_{Full CI}\):这个能量差就是电子相关能 \(E_{\text{corr}}\)

4. 高级工具:二次量子化

CI 理论在数学上通常用产生算符 \(a^\dagger\)湮灭算符 \(a\) 来表述。

  • \(a_p^\dagger\):在轨道 \(p\)产生一个电子。
  • \(a_p\):在轨道 \(p\)湮灭一个电子。

使用这个工具,构建激发态变得非常简洁: * 单激发\(|\Phi_i^a\rangle = a_a^\dagger a_i |\Phi_{HF}\rangle\) (含义:在 HF 基态上,先湮灭轨道 \(i\) 的电子,再产生轨道 \(a\) 的电子) * 双激发\(|\Phi_{ij}^{ab}\rangle = a_a^\dagger a_b^\dagger a_j a_i |\Phi_{HF}\rangle\) (含义:湮灭 \(i, j\) 的电子,产生 \(a, b\) 的电子)

这些算符自动满足反对易关系(如 \(\{ a_i, a_j^\dagger \} = \delta_{ij}\)),这保证了波函数自动满足泡利不相容原理。

总结:“幸运的误差抵消”

最后,一个有趣的问题:为什么 Koopmans 定理(\(I \approx -\epsilon_{HOMO}\))这个“粗糙”的近似,在实际中经常比“更准确”的 \(\Delta SCF\) 方法(\(I = E(N-1) - E(N)\)更接近实验值

答案是 “幸运的误差抵消”

  1. 误差 1 (轨道弛豫):Koopmans’ 定理忽略了轨道弛豫,这使得它计算的 \(I\) 偏高
  2. 误差 2 (电子相关能):Koopmans’ 定理(作为 HF 理论的一部分)忽略了电子相关能。相关能对 N 电子体系的稳定化(能量降低)比对 (N-1) 体系更显著,这使得 HF 计算的 \(I\) 偏低

因此,Koopmans’ 定理(\(I \approx -\epsilon_{HOMO}\) 在与实验值比较时: * 误差 1(高估 \(I\)误差 2(低估 \(I\) 在一定程度上相互抵消了! * 这使得这个“错上加错”的近似,最终给出了一个出奇“准确”的结果。

PHYS 5120 - 计算能源材料和电子结构模拟 Lecture

Lecturer: Prof.PAN DING

I:

这些内容共同构成了量子化学中用于计算分子电子结构的基石——Hartree-Fock (HF) 自洽场 (SCF) 方法。

核心概念:Hartree-Fock (HF) 近似

  • 目标:求解多电子体系(如分子)的薛定谔方程。但这个方程非常复杂,无法精确求解。
  • 核心思想 (HF 近似)
    1. 波恩-奥本海默近似:假定原子核固定不动,我们只关心电子的运动。
    2. 单电子近似这是 HF 近似最关键的一步。它假设每个电子的运动只受到一个“平均场”的影响,这个平均场是由原子核和其他所有电子共同产生的。
    3. 反对称性:电子是费米子,必须满足泡利不相容原理。HF 方法通过一个称为“斯莱特行列式 (Slater Determinant)”的数学工具来构造总波函数,以自动满足这个要求。

⬅基本算符与方程

1. Hartree-Fock 方程:f̂ |φi> = εi |φi>

它在形式上是一个本征方程。

  • |φi> (分子轨道):这是我们要求解的第 i 个单电子波函数(或称为轨道)。
  • (Fock 算符):这是一个等效的单电子哈密顿算符。它代表了一个电子在原子核和所有其他电子的“平均场”中所感受到的总能量。
  • εi (轨道能量):这是第 i 个分子轨道的能量,即电子占据该轨道时的能量。

2. Fock 算符的构成:f̂ = ĥ₁(r) + Σ[j] (2Ĵj(r) - K̂j(r))

这个公式详细定义了 Fock 算符 是由什么组成的。

  • ĥ₁(r) (核心哈密顿算符):这是“单电子”部分,与电子间的相互作用无关。它只包括:
    1. 电子的动能:白板上的 -(ħ²/2m)∇² 项。
    2. 电子与所有原子核的吸引势能:白板上的 -Σ[I] (Z_I e² / |r - R_I|) 项。
  • Σ[j] (2Ĵj(r) - K̂j(r)) (双电子相互作用):这是描述电子 i 与其他所有电子 j 之间平均相互作用的项(求和 Σ[j] 遍历所有被占据的轨道)。
    • Ĵj(r) (库仑算符)
      • 物理意义:描述了电子 i 与轨道 j 上的电子云(密度为 |φj|²)之间的经典静电排斥
      • 白板上的定义Ĵj(ψ) = <φj | e² / |r - r'| | φj> ψ。这意味着 Ĵj 作用在一个函数 ψ(r) 上时,它会计算 ψ(r)φj(r') 之间的平均排斥能。
    • K̂j(r) (交换算符)
      • 物理意义:这是一个纯粹的量子力学效应,没有经典对应。它源于波函数的反对称性要求(泡利不相容原理)。它修正了电子“自我排斥”的错误(因为 Ĵj 包含了 j=i 的情况),并降低了自旋平行电子相遇的概率,从而降低了能量。
      • 白板上的定义K̂j(ψ) = <φj | e² / |r - r'| | ψ> φj。注意 ψφj 在积分符号内的位置发生了“交换”,因此得名。
    • 系数 “2”:在闭壳层(Closed-Shell)HF 方法中,我们假设每个分子轨道 j 都被两个自旋相反的电子(一个自旋向上 α,一个自旋向下 β)占据。因此,Ĵj 的排斥作用要乘以 2。而 K̂j 的交换作用只发生在自旋相同的电子之间,因此只有一个 K̂j 被减去(例如,自旋向上的电子 i 只与自旋向上的电子 j 发生交换)。

3. HOMO / LUMO

  • HOMO (Highest Occupied Molecular Orbital):最高占据分子轨道。这是 HF 计算得到的 εi 能量中,能量最高但仍被电子占据的那个轨道。
  • LUMO (Lowest Unoccupied Molecular Orbital):最低未占分子轨道。这是 εi 能量中,能量最低但没有电子占据的那个轨道。
  • 意义:HOMO 和 LUMO 统称为“前线轨道”。它们的能量差(HOMO-LUMO Gap)和形状在化学反应中至关重要,决定了分子倾向于从哪里给出电子 (HOMO) 和从哪里接受电子 (LUMO)。

矩阵化 (Roothaan-Hall 方法)

直接求解上面的 Hartree-Fock 方程(它是微分-积分方程)非常困难。C. C. J. Roothaan 和 G. G. Hall 提出了一种将其转换为标准矩阵代数问题的方法。

1. LCAO 展开

我们假设未知的分子轨道 |φi> 可以由一组已知的原子轨道 (Atomic Orbitals, AOs) |χμ> 线性组合而成: |φi> = Σ[μ] Cμi |χμ> 其中 Cμi 是我们要解的系数

2. Roothaan-Hall 方程:F C = S C ε

这是将 LCAO 展开代入 HF 方程后得到的矩阵方程

  • F (Fock 矩阵)Fμν = <χμ | f̂ | χν>。这是 Fock 算符 在原子轨道基组下的矩阵表示。
  • C (系数矩阵):矩阵的每一列 i 都是一个分子轨道 φi 的展开系数 Cμi
  • S (重叠矩阵)Sμν = <χμ | χν>。它描述了原子轨道基函数之间的重叠程度。如果基函数是“正交”的,S 就是单位矩阵。
  • ε (轨道能量矩阵):这是一个对角矩阵,对角线上的元素 εi 就是我们要求的分子轨道能量。

3. Fock 矩阵元的计算:Fμν = Hμν^core + Gμν

这是求解的核心,它把 F 矩阵的计算分为两部分:

  • Hμν^core = <χμ | ĥ₁ | χν> (核心哈密顿矩阵元)
    • 物理意义:这是“单电子”部分,只包含电子动能和电子-原子核吸引能。
    • 计算:这部分在整个计算过程中只用计算一次,因为它不依赖于电子的分布。
  • Gμν (双电子积分项)
    • 物理意义:这是“双电子”排斥部分,对应于 Σ[j] (2Ĵj - K̂j)
    • 问题:计算 Gμν 需要知道库仑和交换算符,而这些算符又依赖于分子轨道 |φj>,而 |φj> 又是由我们要求解的系数 C 决定的。这就形成了一个“鸡生蛋,蛋生鸡”的循环问题。

4. 密度矩阵与 SCF 循环

为了解决这个循环问题,我们引入了密度矩阵 P

  • ③ 密度矩阵 (Density Matrix) P
    • 白板上的定义Pαβ = 2 Σ[j=1 to N/2] Cαj* Cβj
    • 物理意义P 描述了电子在原子轨道基组上的分布情况。它由系数矩阵 C 构造。
  • 使用 P 构建 Fock 矩阵 F
    • 白板上的公式Fμν = Hμν^core + Σ[αβ] Pαβ * (...) (白板上 (...) 部分代表了 (μν|αβ) - 1/2(μα|νβ) 这样的双电子积分项)。
    • 关键点F 矩阵(代表电子间的相互作用)现在被表示为密度矩阵 P(代表电子的分布)的函数。F = F(P)

总结:自洽场 (SCF) 的完整流程

完整地描述了 “自洽场” (Self-Consistent Field, SCF) 的计算流程:

  1. 第 0 步:选择原子轨道基组 |χμ>,并计算所有不变的积分,如重叠矩阵 S 和核心哈密顿矩阵 H^core
  2. 第 1 步 (猜测):对系数矩阵 C 做一个初始猜测(例如,通过一个更简单的方法得到),并用它来构建一个初始的密度矩阵 P^(0)
  3. 第 2 步 (构建):使用当前的密度矩阵 P^(k)构建 Fock 矩阵 F^(k)Fμν = Hμν^core + Gμν(P^(k))
  4. 第 3 步 (求解):求解 Roothaan-Hall 矩阵方程 F^(k) C^(k+1) = S C^(k+1) ε^(k+1),得到一组新的系数矩阵 C^(k+1) 和新的轨道能量 ε^(k+1)
  5. 第 4 步 (更新):使用新的系数 C^(k+1) 来计算一个新的密度矩阵 P^(k+1)
  6. 第 5 步 (检查自洽):比较新的密度矩阵 P^(k+1)旧的密度矩阵 P^(k)
    • 如果 P^(k+1)P^(k) 几乎相同(即“自洽”了),说明我们找到的电子分布 P 所产生的“平均场” F,反过来再求解这个 F 得到的电子分布恰好就是 P 本身。计算收敛,循环结束。
    • 如果它们不相同,就令 k = k+1,返回第 2 步,用新的 P 继续迭代。

这个迭代过程,就是 Hartree-Fock 自洽场方法的核心。

II:

两个关键部分:

  1. 1.:继续详细推导如何使用密度矩阵 (Density Matrix) 来构建 Fock 矩阵中的双电子相互作用项。
  2. 2.:展示了如何从数学上求解 Roothaan-Hall 方程 (F C = S C ε),这是一个“广义本征值问题”,并将其转换为计算机可以轻松处理的“标准本征值问题”。

Fock 矩阵元的最终形式

这部分是整个 HF-SCF 方法中最核心的数学推导之一。它展示了如何将复杂的双电子相互作用(Gμν)表示为密度矩阵 P 和一堆预先计算好的积分的乘积。

  • Pαβ = 2 Σ[j=1 to N/2] Cαj* Cβj
    • 重申密度矩阵 P 的定义。它由系数矩阵 C 构建。
  • 推导 <χμ | Σ[j] ... | χν>
    • 这一长串推导(从 开始,一直到 = Σ[αβ] Pαβ (...))是白板上最复杂的部分。
    • 目标:计算 Fock 矩阵的双电子部分 Gμν = <χμ | Σ[j] (2Ĵj - K̂j) | χν>
    • 步骤
      1. 将分子轨道 |φj> 用原子轨道 |χ> 展开:|φj> = Σ[α] Cαj |χα>
      2. 将这个展开式代入 ĴjK̂j 的积分定义中。
      3. 这会产生涉及四个原子轨道基函数的积分,称为双电子积分 (Two-Electron Integrals),通常写作 (μν|αβ)
      4. 经过复杂的代数重排(将 CΣ[j] 重新组合),推导发现 Gμν 可以被写成: Gμν = Σ[αβ] Pαβ * [ (μν|αβ) - 1/2 (μα|νβ) ]
        • (μν|αβ) 是库仑积分。
        • (μα|νβ) 是交换积分。
  • Fμν = Hμν^core + Σ[αβ] Pαβ (...)
    • 最终的 Fock 矩阵元公式
    • Fμν(Fock 矩阵)= Hμν^core(核心哈密顿矩阵,只算一次)+ Gμν(双电子排斥项)。
    • 关键在于 Gμν密度矩阵 P 的线性函数
    • 这完美地建立了 SCF 的循环关系:CPF → 求解 F 得到新的 C

Roothaan-Hall 方程的求解问题

这部分转向了一个纯粹的数学(线性代数)问题:如何求解我们建立的矩阵方程。

  • S† F C = C ε (笔误)
    • 白板上划掉的 S† F C = C ε 及其旁边的推导 (S†F)† = ... 看起来像是一个错误的尝试或旁注,试图探索这个矩阵的厄米性 (Hermiticity)。
    • 正确的方程(在它下面)是 F C = S C ε
  • F C = S C ε (Roothaan-Hall 方程)
    • 问题:这不是一个“标准本征值问题” (A x = λ x),因为在等式右边多了一个重叠矩阵 SS 不是单位矩阵 I,因为原子轨道基组 |χ> 通常不是正交的。
    • 术语:这被称为“广义本征值问题”。
  • S is positive definite (S 是正定矩阵)
    • 这是一个关键的数学性质。S 是正定的,意味着它所有的本征值(λi)都大于零。
    • 意义:因为 S 是正定的,所以它保证是可逆的(S⁻¹ 存在),并且我们可以对它进行“开方”,即找到 S^(1/2)S^(-1/2)
  • S = R† RS = U† Λ U (S 的分解)
    • 这是对 S 矩阵进行对角化或分解的标准方法。
    • S = U† Λ U谱分解
      • US 的本征向量矩阵。
      • Λ 是由 S 的本征值 λi 组成的对角矩阵。
  • S^(1/2) = U† √Λ US^(-1/2) = U† (1/√Λ) U
    • (右侧白板上有 S^(-1/2) 的定义)。
    • 这是利用 S 的分解来定义它的 1/2 次方和 -1/2 次方矩阵。√Λ 就是简单地将对角矩阵 Λ 上的每个元素 λi 都开方。

变换为标准本征值问题

这部分展示了如何利用 S^(-1/2) 矩阵来“清除”方程中的 S,将其变为标准本征值问题。这个过程称为正交化 (Orthogonalization)

  1. 目标:将 F C = S C ε 变换为 F' C' = C' ε 的形式。
  2. 定义新的系数矩阵 C': 我们定义一组新的、在正交化基组下的系数 C',它与原始系数 C 的关系是: C' = S^(1/2) C (因此 C = S^(-1/2) C'
  3. 代入原方程: 将 C = S^(-1/2) C' 代入 F C = S C εF (S^(-1/2) C') = S (S^(-1/2) C') ε
  4. 两边左乘 S^(-1/2)(S^(-1/2) F S^(-1/2)) C' = (S^(-1/2) S S^(-1/2)) C' ε
  5. 简化
    • 右侧:如白板所示,S^(-1/2) S S^(-1/2) = S^(-1/2) S^(1/2) S^(1/2) S^(-1/2) = I(单位矩阵)。
    • 左侧:我们定义一个新的、变换后的 Fock 矩阵 F'F' = S^(-1/2) F S^(-1/2)
  6. 最终的标准本征值方程F' C' = C' ε

总结:SCF 循环的完整计算步骤

一个完整的 SCF 迭代步骤如下:

  1. 猜测 C^(0) (或 P^(0))。
  2. 构建 P:使用 C^(k) 计算密度矩阵 P^(k)
  3. 构建 F:使用 P^(k) 构建 Fock 矩阵 F^(k)(如左侧推导所示)。
  4. 构建 F’:使用 S^(-1/2)(它在计算开始前就算好了)和 F^(k) 来构建变换后的 Fock 矩阵 F'^(k) = S^(-1/2) F^(k) S^(-1/2)
  5. 求解F'^(k) C'^(k+1) = C'^(k+1) ε^(k+1)。这是一个标准本征值问题,计算机可以高效求解,得到新的 C'ε
  6. 反变换:通过 C^(k+1) = S^(-1/2) C'^(k+1) 得到我们真正的系数矩阵 C
  7. 检查收敛:用 C^(k+1) 计算新的 P^(k+1),与 P^(k) 比较。如果不收敛,返回第 2 步。

在数学上解决了如何在非正交基组下求解 HF 方程的实际计算问题。

III:

两个主要部分:

  1. 上部:完成 Roothaan-Hall 方程的数学求解变换。
  2. 下部:引入一个全新且至关重要的概念——Hartree-Fock (HF) 的总能量

方程的最终求解形式

“标准本征值问题”变换的总结和补充。

  • S^(-1/2) = U† (1/√λ ... 0; 0 ... 1/λ_m) U
    • 这是 S^(-1/2) 矩阵的明确计算方法,即通过对重叠矩阵 S 进行“谱分解”:
      1. 对角化 S 得到其本征向量矩阵 U 和本征值对角矩阵 Λ (对角元为 λ_i)。
      2. 计算 Λ^(-1/2)(即把每个 λ_i 替换为 1/√λ_i)。
      3. 重新组合 U† Λ^(-1/2) U 得到 S^(-1/2)
  • (blas, lapack)
    • 这是一个非常实际的课堂笔记。BLAS (Basic Linear Algebra Subprograms) 和 LAPACK (Linear Algebra Package) 是用于高性能科学计算(如矩阵对角化、求逆等)的黄金标准软件包
    • 教授在这里的意思是:“这个矩阵运算(S^(-1/2))我们手算不了,但计算机上的量子化学软件会调用 LAPACK 库来高效地完成它。”
  • F' C' = C' ε
    • 最终方程。重申上一张白板的结论:我们已经成功地将广义本征值问题 F C = S C ε 转换为了标准本征值问题 F' C' = C' ε
    • 这是计算机可以(通过 LAPACK)直接求解的。
  • F' = S^(-1/2) F S^(-1/2) = (F')†
    • 这一行在确认一个重要的数学性质:F' 矩阵也是厄米 (Hermitian) 的
    • (dagger) 符号代表“厄米共轭”(转置并取复共轭)。
    • 因为 F 是厄米的 (F† = F),S 也是厄米的,所以 F' 保证是厄米的。这确保了我们求解得到的轨道能量 ε 必定是实数,这在物理上是必需的。

Hartree-Fock (HF) 总能量

这是本张白板的核心。在 SCF 迭代收敛后,我们得到了所有的轨道能量 ε_i。那么,分子的总能量 E_HF 是多少?

一个常见的陷阱:你可能会认为总能量就是所有占据轨道的能量之和(2 Σ ε_i,因为每个轨道 2 个电子)。白板明确指出:这是错误的!

  • E_HF ≠ 2 Σ[i=1 to N/2] ε_i (总能量 ≠ 轨道能量之和)

    • 为什么?
    • 白板上的最后一行给出了答案。轨道能量 ε_i 的定义是: ε_i = <φ_i | f̂ | φ_i> = ε_ii + Σ[j=1 to N/2] (2J_ij - K_ij)
    • 物理含义ε_i 不仅仅是电子 i 的能量,它代表了将一个电子加入到轨道 i 中所需要的能量。这个能量包括了:
      1. ε_ii:电子 i 自己的动能和它与所有原子核的吸引能。
      2. Σ[j] (2J_ij - K_ij):电子 i所有其他 j 轨道电子的库仑排斥和交换作用。
    • 双重计算问题
      • ε_i 包含了 ij 的排斥能。
      • ε_j 包含了 ji 的排斥能。
      • 如果你简单地将它们相加 (ε_i + ε_j),你就把 ij 之间的排斥能计算了两遍
    • 因此,2 Σ ε_i双倍计算所有的电子-电子排斥能,导致结果错误。
  • 正确的 HF 总能量公式 白板给出了两个等价的正确公式:

    1. E_HF = 2 Σ[i] ε_ii + Σ[i,j] (2J_ij - K_ij) (白板第一行)
      • 2 Σ[i] ε_ii:所有电子的动能 + 电子-原子核吸引能(ε_ii 在这里是核心哈密顿积分 h_ii)。
      • Σ[i,j] (2J_ij - K_ij):所有电子对之间的排斥/交换能(只计算一次!)。
    2. E_HF = Σ[i=1 to N/2] (ε_ii + ε_i) (白板中间行)
      • 这是一个更巧妙、更简洁的公式。
      • 它将总能量表示为:对所有占据轨道 i 求和,每一项是(核心哈密顿积分 ε_ii + 轨道能量 ε_i)。
      • 这个公式通过只加一次 ε_ii(单电子项)和一次 ε_i(包含单电子项和双电子项),巧妙地修正了双重计算问题,最终结果与公式 1 完全等价。

总结

从“如何求解”到“如何获取最终能量”的过渡。它展示了求解 F C = S C ε 的实用计算方法,并着重强调了 HF 总能量 E_HF 和轨道能量 ε_i 之间的关键区别。

IV

它用一个清晰的流程图 (Flowchart),前面包含的所有复杂的数学公式和概念,总结成了一个完整的计算算法

这就是 Hartree-Fock 自洽场 (Self-Consistent Field, SCF) 迭代循环的标准计算流程

💡 SCF 流程图详解

fusion

这个流程图展示了量子化学程序是如何一步步“猜”出正确答案的。

1. 准备工作 (循环开始前)

  • Calculate one, two-electron Integrals
    • 这是计算的“第 0 步”,在循环开始前只需要做一次
    • 程序会计算所有需要的“积木块”:
      1. 单电子积分:重叠矩阵 S 和核心哈密顿矩阵 H_core
      2. 双电子积分:所有 (μν|αβ) 形式的积分。这些积分数量极其庞大,是 HF 计算中最耗时的一步。
  • [ S -> S^(-1/2) ]
    • 利用第一步算出的 S 矩阵,计算出用于正交化的 S^(-1/2) 矩阵。这也只需要做一次

2. SCF 迭代循环 (The Loop)

  • [ P_αβ ] (起始点)
    • 第 1 步:猜测。循环开始,我们必须提供一个初始的密度矩阵 P
    • 白板上的 P_αβ^ini = 0 是一个最简单的“零猜测”,实际程序通常会用更高级的猜测方法(如 H_core 猜测)。
  • [ F_μν ]
    • 第 2 步:构建 Fock 矩阵
    • 使用当前的密度矩阵 P_αβ,根据我们在第二张白板上的公式 F = H_core + G(P) 来构建当前的 Fock 矩阵 F
  • [ F' C' = C' ε ]
    • 第 3 步:求解 Roothaan-Hall 方程
    • 正如第三张白板所示,我们不直接解 F C = S C ε
    • 我们先进行变换:F' = S^(-1/2) F S^(-1/2)
    • 然后求解这个“标准本征值问题”,得到新的轨道能量 ε变换后的系数 C'
  • [ C = S^(-1/2) C' ]
    • 第 4 步:反变换
    • S^(-1/2) 矩阵将 C' 转换回我们真正需要的、在原子轨道基组下的系数矩阵 C
  • [ P_αβ^old ~ P_αβ^new ] (决策点)
    • 第 5 步:检查自洽性 (Convergence Check)
    • 使用第 4 步得到的C,计算出一个新的密度矩阵 P^new
    • 比较这个 P^new 和我们在第 2 步中使用的 P^old
    • N (No): 如果 P^newP^old 差别很大(未收敛),则自洽尚未达成。
      • 循环:将 P^new 作为下一次迭代的 P返回第 2 步[ F_μν ]),用这个新的 P 去构建新的 F
    • Y (Yes,未画出): 如果 P^newP^old 几乎完全相同(差值小于某个阈值,例如 10⁻⁸),则说明“自洽”达成!
      • 循环结束

总结

以上内容从“为什么”(HF 近似)到“是什么”(HF 方程和矩阵)再到“怎么做”(SCF 流程图)。

SCF 流程是所有基于 Hartree-Fock 方法(以及更高级的后 HF 方法)的计算化学软件的核心算法。

PHYS 5120 - 计算能源材料和电子结构模拟 Lecture

Lecturer: Prof.PAN DING

I:

量子化学中的Hartree-Fock (HF) 理论。这是一个核心的近似方法,它将一个复杂的多电子问题简化为一组耦合的单电子问题。

fusion

1. N 电子波函数:Slater 行列式 (The \(N\)-electron Wavefunction: \(\Phi\))

  • 概念: 对于一个 \(N\) 电子体系,其总波函数 \(\Phi\) 必须满足泡利不相容原理(Pauli Exclusion Principle),即交换任意两个电子的坐标(包括空间坐标 \(\vec{r}\) 和自旋坐标 \(\omega\)),波函数必须反号(反对称性)。
  • Slater 行列式: 白板上指出 \(\Phi\) 是一个 Slater 行列式 (Slater determinant)。这是满足反对称性的最简单的波函数形式。它由 \(N\) 个单电子自旋-轨道 \(\phi_i\) 构筑而成。
  • 公式: \[ \Phi(x_1, x_2, \dots, x_N) = \frac{1}{\sqrt{N!}} \begin{vmatrix} \phi_1(x_1) & \phi_2(x_1) & \cdots & \phi_N(x_1) \\ \phi_1(x_2) & \phi_2(x_2) & \cdots & \phi_N(x_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_1(x_N) & \phi_2(x_N) & \cdots & \phi_N(x_N) \end{vmatrix} \]
    • \(x_i = (\vec{r}_i, \omega_i)\) 代表第 \(i\) 个电子的空间和自旋坐标。
    • \(\frac{1}{\sqrt{N!}}\) 是归一化因子。
    • 行列式的性质确保了:(1) 如果交换任意两行(交换两个电子),行列式变号。(2) 如果任意两个轨道相同(两列相同),行列式为零(即两个电子不能处于完全相同的状态)。

2. 自旋-轨道 (Spin-Orbitals: \(\phi_i\))

  • 概念: 白板左侧定义了自旋-轨道 (spin-orbital),它是构成 Slater 行列式的基本单元。每个自旋-轨道 \(\phi_i\) 都是一个空间轨道 \(\psi_j\) 和一个自旋函数 \(\sigma\) 的乘积。
  • 公式: \[\phi_i(x_i) = \psi_j(\vec{r}_i) \sigma(\omega_i)\]
    • \(\psi_j(\vec{r}_i)\) 是空间部分,如 \(1s\), \(2p_z\) 轨道,它描述了电子在空间中的分布。
    • \(\sigma(\omega_i)\) 是自旋部分,只有两种可能:自旋向上 \(\alpha(\omega)\) (spin-up) 或自旋向下 \(\beta(\omega)\) (spin-down)。

3. Hartree-Fock 能量 (The Hartree-Fock Energy: \(E_{HF}\))

  • 概念: 体系的总能量 \(E\) 是通过求解定态薛定谔方程 \(\hat{H}\Phi = E\Phi\) 得到的。在 HF 理论中,我们使用 Slater 行列式 \(\Phi\) 作为真实波函数的近似,并通过变分原理计算其能量期望值。这个能量就是 Hartree-Fock 能量 \(E_{HF}\)
  • 公式: \[E_{HF} = \langle\Phi|\hat{H}|\Phi\rangle = \int \Phi^* \hat{H} \Phi \,d\tau\]
    • \(\hat{H}\) 是体系的总哈密顿算符(Hamiltonian operator),即总能量算符。
    • \(\langle\Phi|...|\Phi\rangle\) 是 bra-ket 记号,代表对所有电子的所有坐标进行积分。

4. 能量的组成 (Components of the Energy)

  • 概念: 将上述 \(\Phi\)\(\hat{H}\)(包含所有电子的动能、核-电吸引、电-电排斥)代入 \(E_{HF}\) 的 积分并化简(这个过程被称为 Slater-Condon 规则),可以得到白板上最重要的核心公式:
  • 公式 (1) - 总结形式: \[E_{HF} = \sum_{i=1}^{N} E_{ii} + \frac{1}{2} \sum_{i,j=1}^{N} (J_{ij} - K_{ij})\]
    • 这个公式将总能量分解为三部分。\(\frac{1}{2}\) 因子是为了在 \(i,j\) 双重求和中避免重复计算电子对(\(i,j\) 对和 \(j,i\) 对)。当 \(i=j\) 时,\(J_{ii} = K_{ii}\),所以 \(J_{ii} - K_{ii} = 0\),求和中可以包含 \(i=j\) 项。

下面我们来详细分解这三项:

A. 单电子能量 (\(E_{ii}\)\(h_{ii}\))

  • 概念: 这是第 \(i\) 个电子在自旋-轨道 \(\phi_i\) 上的核心能量。它包括该电子自身的动能和它与所有原子核之间的库仑吸引能
  • 公式: \[E_{ii} = \langle\phi_i|\hat{h}|\phi_i\rangle = \int \phi_i^*(x_1) \left[ -\frac{\hbar^2}{2m}\nabla_1^2 + V_{\text{nuc}}(\vec{r}_1) \right] \phi_i(x_1) \,dx_1\]
    • \(\hat{h} = -\frac{\hbar^2}{2m}\nabla^2 + V_{\text{nuc}}(\vec{r})\) 是单电子核心哈密顿算符。
    • \(V_{\text{nuc}}(\vec{r}_1)\) 是电子 1 与所有原子核之间的吸引势能(白板上写作 \(V_{\text{ext}}\),即外部势)。

B. 库仑积分 (Coulomb Integral: \(J_{ij}\))

  • 概念: 这是经典的静电排斥能。它代表了处于 \(\phi_i\) 轨道的电子电荷云(密度为 \(|\phi_i|^2\))和处于 \(\phi_j\) 轨道的电子电荷云(密度为 \(|\phi_j|^2\))之间的平均静电排斥能
  • 公式: (如白板中间所推导) \[J_{ij} = \langle\phi_i \phi_j | \frac{e^2}{|\vec{r}_1 - \vec{r}_2|} | \phi_i \phi_j \rangle = \iint |\phi_i(x_1)|^2 \frac{e^2}{|\vec{r}_1 - \vec{r}_2|} |\phi_j(x_2)|^2 \,dx_1 dx_2\]
    • \(J_{ij}\) 总是正值(\(J_{ij} > 0\)),它使体系的总能量升高(排斥作用)。

C. 交换积分 (Exchange Integral: \(K_{ij}\))

  • 概念: 这是纯粹的量子力学效应,没有经典对应。它源于波函数的反对称性(泡利原理)。它只存在于自旋相同的电子对之间(白板上用 \(\delta(\sigma_i, \sigma_j)\)\(\delta_{\sigma_i \sigma_j}\) 表示)。
  • 公式: (如白板中间所推导) \[K_{ij} = \langle\phi_i \phi_j | \frac{e^2}{|\vec{r}_1 - \vec{r}_2|} | \phi_j \phi_i \rangle = \iint \phi_i^*(x_1) \phi_j^*(x_2) \frac{e^2}{|\vec{r}_1 - \vec{r}_2|} \phi_j(x_1) \phi_i(x_2) \,dx_1 dx_2\]
    • 关键区别: 注意 \(J_{ij}\)\(K_{ij}\) 在积分号内右侧的区别。在 \(K_{ij}\) 中,电子 1 和 2 在末态轨道上 “交换” 了位置。
    • \(K_{ij}\) 也是正值(\(K_{ij} > 0\))。但在总能量公式中,它以 \(-K_{ij}\) 出现。
    • 物理意义: 这意味着交换作用使体系的总能量降低(体系更稳定)。

5. 洪德定则 (Hund’s Rule)

  • 概念: 白板底部写着 “\(\uparrow\uparrow \downarrow\uparrow \uparrow\uparrow\) is the ‘#’ Hund’s rule” (\(\uparrow\uparrow\) 能量更低)。这正是交换积分 \(K_{ij}\) 的直接物理后果。
  • 解释:
    1. 洪德第一定则:电子在填充简并轨道(能量相同的轨道)时,会优先自旋平行地(\(\uparrow\uparrow\))分占不同轨道,以使总自旋最大。
    2. 原因:
      • 对于 \(\uparrow\uparrow\)(自旋平行)状态:两个电子自旋相同,它们之间的交换积分 \(K_{ij} \neq 0\)。总能量中包含 \(-K_{ij}\) 项,使能量降低
      • 对于 \(\uparrow\downarrow\)(自旋反平行)状态:两个电子自旋不同,它们之间的交换积分 \(K_{ij} = 0\)。没有交换作用带来的能量降低。
    • 因此,自旋平行的 \(\uparrow\uparrow\) 状态能量更低、更稳定,这正是 \(K\) 项(交换能)导致的。

6. Hartree-Fock 方程 (The HF Equation)

  • 概念: 白板最右侧开始写的 “HF eq…”。我们有了总能量 \(E_{HF}\) 的表达式,但我们还不知道 “最好” 的 \(\phi_i\) 是什么。HF 方法的目标就是找到一组 \(\phi_i\),使 \(E_{HF}\) 能量最低。
  • 方法: 使用变分法,在保持轨道 \(\phi_i\) 归一正交的约束下,最小化 \(E_{HF}\)(即 \(\delta E_{HF} = 0\))。
  • 结果: 这样做会得到一组 \(N\) 个耦合的单电子方程,即 Hartree-Fock 方程\[\hat{f}(x_1) \phi_i(x_1) = \epsilon_i \phi_i(x_1)\]
    • \(\epsilon_i\) 是第 \(i\) 个自旋-轨道的轨道能 (orbital energy)
    • \(\hat{f}(x_1)\)Fock 算符 (Fock operator),它是一个等效的单电子哈密顿算符。
  • Fock 算符 \(\hat{f}\): \[\hat{f}(x_1) = \hat{h}(x_1) + \sum_{j=1}^{N} \left[ \hat{J}_j(x_1) - \hat{K}_j(x_1) \right]\]
    • \(\hat{h}(x_1)\):就是之前的单电子核心算符(动能 + 核吸引)。
    • \(\hat{J}_j(x_1)\)库仑算符,代表电子 1 受到 \(\phi_j\) 轨道上电子的平均静电排斥。
    • \(\hat{K}_j(x_1)\)交换算符,代表电子 1 受到 \(\phi_j\) 轨道上电子的(仅当自旋相同时才有的)交换”吸引”。
  • 自洽场 (Self-Consistent Field, SCF): HF 方程的”棘手”之处在于 \(\hat{f}\) 算符本身依赖于它要求解的 \(\phi_j\) 轨道。这必须通过迭代法求解,即 SCF 方法
    1. 猜测一组 \(\phi_j\)
    2. \(\phi_j\) 构建 \(\hat{f}\) 算符。
    3. 求解 \(\hat{f}\phi_i = \epsilon_i \phi_i\) 得到一组新的 \(\phi_i\)
    4. 重复 2-3 步,直到输入的 \(\phi_j\) 和输出的 \(\phi_i\) 不再变化(达到”自洽”)为止。

II

从左到右展示了 Hartree-Fock (HF) 方程的最终形式、如何将其简化为闭壳层 (closed-shell) 情况,以及最终如何通过 LCAO 近似 将其转化为可解的矩阵方程,即 Roothaan-Hall 方程

fusion

1. Hartree-Fock (HF) 方程的推导与形式

白板的左侧展示了 HF 方程本身,它是通过变分法推导出来的。

  • 变分法 (Variational Principle): \[\delta \left( \langle\Phi|\hat{H}|\Phi\rangle - \sum_{i,j} \epsilon_{ij} (\langle\phi_i|\phi_j\rangle - \delta_{ij}) \right) = 0\]
    • 这是一个约束下的最小化问题。我们要求总能量 \(E_{HF} = \langle\Phi|\hat{H}|\Phi\rangle\) 的变分(\(\delta\))为零(即找到能量最小值)。
    • 约束条件是要求所有的自旋-轨道 \(\phi_i\) 必须保持归一正交\(\langle\phi_i|\phi_j\rangle = \delta_{ij}\))。
    • \(\epsilon_{ij}\)拉格朗日乘子 (Lagrange multipliers),在最终的方程中,对角项 \(\epsilon_{ii}\) 就被证明是轨道能 \(\epsilon_i\)
  • Hartree-Fock 方程 (HF equation): 对上述变分表达式求解,最终会得到 \(N\) 个耦合的单电子方程,白板上写出了其中第 \(i\) 个方程: \[\left[ -\frac{\hbar^2}{2m}\nabla_i^2 + V_{\text{ext}}(\vec{r}_i) \right] \phi_i(\vec{r}_i) + \sum_{j} \left( \dots \right) - \sum_{j} \left( \dots \right) = \epsilon_i \phi_i(\vec{r}_i)\] 这个方程可以更紧凑地写为: \[\hat{f}(x_1) \phi_i(x_1) = \epsilon_i \phi_i(x_1)\]
    • \(\epsilon_i\) 是第 \(i\) 个自旋-轨道的轨道能
    • \(\hat{f}(x_1)\)Fock 算符 (Fock operator),它是一个等效的单电子哈密顿算符。
  • SCF (自洽场): 如上一张图所述,Fock 算符 \(\hat{f}\) 本身又依赖于所有的 \(\phi_j\) 轨道。因此,这个方程必须通过自洽场 (Self-Consistent Field, SCF) 方法迭代求解。

2. 闭壳层 (Closed-Shell) 近似 (RHF)

白板右上方引入了一个重要的简化:闭壳层 (closed shell)

  • 概念: 假设体系中的 \(N\) 个电子(\(N\) 必须是偶数)两两配对,占据了 \(N/2\)空间轨道 \(\psi_i\)。每个空间轨道 \(\psi_i\) 都被一个自旋向上 (\(\alpha\)) 和一个自旋向下 (\(\beta\)) 的电子同时占据。
  • 能量公式 (RHF Energy): 在这种情况下,总能量公式(如上一张图所示)可以被积分并简化,求和从 \(N\) 个自旋-轨道转变为 \(N/2\) 个空间轨道: \[E_{HF} = 2 \sum_{i=1}^{N/2} E_{ii} + \sum_{i,j=1}^{N/2} (2J_{ij} - K_{ij})\]
    • \(E_{ii} = \langle\psi_i|\hat{h}|\psi_i\rangle\),是空间轨道 \(\psi_i\) 的核心能。因子 2 是因为每个轨道有两个电子。
    • \(J_{ij}\)\(K_{ij}\) 现在是空间轨道 \(\psi_i\)\(\psi_j\) 之间的库仑积分和交换积分。
    • \(2J_{ij}\):代表 \(\psi_i\) 中的两个电子和 \(\psi_j\) 中的两个电子之间的所有 4 种库仑排斥。
    • \(-K_{ij}\):代表 \(\psi_i\)\(\psi_j\) 中自旋相同的电子对(\(\alpha\)\(\alpha\), \(\beta\)\(\beta\))之间的 2 种交换作用。
  • Fock 算符 (RHF Fock Operator): \(\hat{f} |\phi_i \rangle = \epsilon_i |\phi_i \rangle\) 这一行是 HF 方程的抽象形式。在闭壳层情况下,它被改写为空间轨道的方程: \[\hat{f}(\vec{r}_1) \psi_i(\vec{r}_1) = \epsilon_i \psi_i(\vec{r}_1)\] 其中,闭壳层 Fock 算符 \(\hat{f}\) 定义为: \[\hat{f}(\vec{r}_1) = \hat{h}(\vec{r}_1) + \sum_{j=1}^{N/2} [2\hat{J}_j(\vec{r}_1) - \hat{K}_j(\vec{r}_1)]\]
    • \(\hat{h}\) 是单电子核心算符(动能+核吸引)。
    • \(\hat{J}_j\)\(\hat{K}_j\) 是根据空间轨道 \(\psi_j\) 定义的库仑算符交换算符
  • 库仑和交换算符 (Coulomb and Exchange Operators): 白板中间定义了这两个算符如何作用在一个任意函数 \(\psi(\vec{r}_1)\) 上:
    • 库仑算符 \(\hat{J}_j\): \[\hat{J}_j(\vec{r}_1) \psi_i(\vec{r}_1) = \left[ e^2 \int d\vec{r}_2 \frac{|\psi_j(\vec{r}_2)|^2}{|\vec{r}_1 - \vec{r}_2|} \right] \psi_i(\vec{r}_1)\] 它是一个局域 (local) 算符。它代表电子 1 受到 \(\psi_j\) 轨道中电子的平均电荷云 \(|\psi_j|^2\) 所产生的经典静电排斥势。
    • 交换算符 \(\hat{K}_j\): \[\hat{K}_j(\vec{r}_1) \psi_i(\vec{r}_1) = \left[ e^2 \int d\vec{r}_2 \frac{\psi_j^*(\vec{r}_2) \psi_i(\vec{r}_2)}{|\vec{r}_1 - \vec{r}_2|} \right] \psi_j(\vec{r}_1)\] 它是一个非局域 (non-local) 算符。算符 \(\hat{K}_j\)\(\psi_i\)\(\vec{r}_1\) 处的作用,却依赖于 \(\psi_i\)\(\psi_j\) 在所有其他点 \(\vec{r}_2\) 的值。这是 HF 理论的数学难点。

3. LCAO 近似与 Roothaan-Hall 方程

我们仍然无法直接解出 \(\psi_i\)。实际计算中,我们采用 LCAO (Linear Combination of Atomic Orbitals) 近似。

  • LCAO 近似: \[|\psi_i\rangle = \sum_{\mu=1}^{M} c_{\mu i} |\chi_\mu\rangle\]

    • 我们将未知的分子轨道 \(\psi_i\) 展开为一组已知的基函数 (basis functions) \(\chi_\mu\) 的线性组合。
    • 这组基函数通常是 centred 在每个原子上的原子轨道(如 1s, 2p 等),共 \(M\) 个。
    • \(c_{\mu i}\)分子轨道系数。我们的问题从“求解一个复杂的函数 \(\psi_i\)”转变为“求解一组最佳的系数 \(c_{\mu i}\)”。
  • 推导 Roothaan-Hall 方程: 我们将 LCAO 展开式代入 HF 方程 \(\hat{f} |\psi_i\rangle = \epsilon_i |\psi_i\rangle\)\[\hat{f} \sum_{\mu} c_{\mu i} |\chi_\mu\rangle = \epsilon_i \sum_{\mu} c_{\mu i} |\chi_\mu\rangle\] 然后,从左侧乘上另一个基函数 \(\langle\chi_\nu|\) 并对全空间积分: \[\sum_{\mu} c_{\mu i} \langle\chi_\nu|\hat{f}|\chi_\mu\rangle = \epsilon_i \sum_{\mu} c_{\mu i} \langle\chi_\nu|\chi_\mu\rangle\] 这在白板的右下角有清晰的展示。

  • Roothaan-Hall 方程 (矩阵形式): 我们可以将上式定义为矩阵元:

    • Fock 矩阵 \(F\): \(F_{\nu\mu} = \langle\chi_\nu|\hat{f}|\chi_\mu\rangle\)
    • Overlap 矩阵 \(S\): \(S_{\nu\mu} = \langle\chi_\nu|\chi_\mu\rangle\) (基函数 \(\chi\) 通常不是正交的,所以 \(S \neq I\))
    • 系数矩阵 \(C\): \(C_{\mu i} = c_{\mu i}\)
    • 轨道能矩阵 \(\epsilon\): \(\epsilon_{ij} = \epsilon_i \delta_{ij}\) (一个对角矩阵)

    于是,上述方程 \(\sum_{\mu} F_{\nu\mu} C_{\mu i} = \sum_{\mu} S_{\nu\mu} C_{\mu i} \epsilon_i\) 就可以写成一个简洁的矩阵方程: \[\mathbf{F} \mathbf{C} = \mathbf{S} \mathbf{C} \mathbf{\epsilon}\] 这正是白板最底部所写的 ( F ) ( C ) = ( S ) ( C ) ( ε )。 这是一个广义本征值问题 (generalized eigenvalue problem)

    由于 Fock 矩阵 \(\mathbf{F}\) 依赖于系数矩阵 \(\mathbf{C}\)(因为 \(\hat{f}\) 依赖于 \(\psi_j\),而 \(\psi_j\) 依赖于 \(c_{\mu j}\)),这个方程仍然必须通过 SCF 迭代 求解,直到 \(C\)\(\epsilon\) 不再变化为止。

III

这张白板继续深入推导,展示了如何将抽象的 Roothaan-Hall 方程 (\(\mathbf{F} \mathbf{C} = \mathbf{S} \mathbf{C} \mathbf{\epsilon}\)) 转化为在实际计算中可以求解的具体形式。

核心思想是:将 Fock 矩阵 \(F_{\mu\nu}\) 的每一个元素,分解为基于已知基函数 \(\chi\) 的积分。

fusion

1. Fock 矩阵的构成 (Fock Matrix Elements)

白板左侧首先将 Fock 矩阵的元素 \(F_{\mu\nu}\) 分解为两部分:

\[F_{\mu\nu} = \langle\chi_\mu | \hat{f} | \chi_\nu\rangle = \langle\chi_\mu | \hat{h} + \sum_{j=1}^{N/2} (2\hat{J}_j - \hat{K}_j) | \chi_\nu\rangle\] \[F_{\mu\nu} = H_{\mu\nu}^{\text{core}} + G_{\mu\nu}\]

A. 核心哈密顿矩阵 (\(H_{\mu\nu}^{\text{core}}\))

  • 概念: 这是单电子部分,代表一个电子在基函数 \(\chi_\mu\)\(\chi_\nu\) 之间的”跳跃”能量。它只包含动能和与所有原子核的吸引能,不涉及电子间的排斥。
  • 公式: \[H_{\mu\nu}^{\text{core}} = \langle\chi_\mu | \hat{h} | \chi_\nu\rangle\] 其中,单电子核心哈密顿算符 \(\hat{h}\) 为: \[\hat{h} = -\frac{\hbar^2}{2m}\nabla^2 - \sum_A \frac{Z_A e^2}{|\vec{r} - \vec{R}_A|}\]
    • \(Z_A\)\(\vec{R}_A\) 分别是原子核 A 的电荷和位置。
    • 这一项在计算开始前就可以一次性计算并存储,因为它只取决于基函数和分子结构,不依赖于其他电子。

B. 双电子排斥矩阵 (\(G_{\mu\nu}\))

  • 概念: 这是 \(F_{\mu\nu}\) 中最复杂的部分,包含了所有电子-电子之间的平均库仑排斥和交换作用。
  • 公式: (对应白板左侧最下方的 @ 和 \(\Theta\) 项) \[G_{\mu\nu} = \sum_{j=1}^{N/2} \left( 2 \langle\chi_\mu | \hat{J}_j | \chi_\nu\rangle - \langle\chi_\mu | \hat{K}_j | \chi_\nu\rangle \right)\]
    • 这一项依赖于分子轨道 \(\psi_j\)(通过 \(\hat{J}_j\)\(\hat{K}_j\)),而 \(\psi_j\) 正是我们要解的未知量。

2. 双电子积分 (Two-Electron Integrals, ERIs)

白板的右侧是整个推导的关键。它展示了如何通过代入 LCAO 近似(\(\psi_j = \sum_\alpha C_{\alpha j} \chi_\alpha\))来消除对未知 \(\psi_j\) 的依赖,将其全部转化为对已知基函数 \(\chi\) 的积分。

A. LCAO 展开

我们将 LCAO 展开式代入 \(G_{\mu\nu}\) 中的库仑项和交换项。这会产生一个四重循环,涉及四个基函数。

  • 双电子积分 (ERI) 记号: 白板使用了物理学家记号 (physicist’s notation)\[\langle\mu\alpha|\hat{g}|\nu\beta\rangle = \langle\chi_\mu \chi_\alpha | \frac{e^2}{|\vec{r}_1 - \vec{r}_2|} | \chi_\nu \chi_\beta \rangle\] \[= \iint \chi_\mu^*(\vec{r}_1) \chi_\alpha^*(\vec{r}_2) \frac{e^2}{|\vec{r}_1 - \vec{r}_2|} \chi_\nu(\vec{r}_1) \chi_\beta(\vec{r}_2) \,d\vec{r}_1 d\vec{r}_2\]
    • \(\mu, \nu\) 对应电子 1 的基函数。
    • \(\alpha, \beta\) 对应电子 2 的基函数。

B. 库仑项 (Coulomb Term)

  • 推导: \[2 \sum_{j=1}^{N/2} \langle\chi_\mu | \hat{J}_j | \chi_\nu\rangle = 2 \sum_{j=1}^{N/2} \sum_{\alpha, \beta} C_{\alpha j}^* C_{\beta j} \langle\mu\alpha|\hat{g}|\nu\beta\rangle\]
    • 这对应白板右侧的第一个长公式。
    • 它代表了 \(\chi_\mu \chi_\nu\) 电荷密度(电子1)和所有 \(\chi_\alpha \chi_\beta\) 电荷密度(电子2)之间的库仑排斥。

C. 交换项 (Exchange Term)

  • 推导: \[-\sum_{j=1}^{N/2} \langle\chi_\mu | \hat{K}_j | \chi_\nu\rangle = -\sum_{j=1}^{N/2} \sum_{\alpha, \beta} C_{\alpha j}^* C_{\beta j} \langle\mu\alpha|\hat{g}|\beta\nu\rangle\]
    • 这对应白板右侧的第二个长公式。
    • 请注意索引的变化: 库仑项是 \(\langle\mu\alpha|\hat{g}|\nu\beta\rangle\),交换项是 \(\langle\mu\alpha|\hat{g}|\beta\nu\rangle\)。在交换项中,电子 1 和 2 的末态基函数被交换了 (\(\nu \leftrightarrow \beta\))。

3. 密度矩阵 (Density Matrix, \(P\))

最后,为了简化公式并明确其依赖关系,我们将与系数 \(C\)\(j\) 相关的求和项组合起来,定义一个新矩阵:密度矩阵 (Density Matrix) \(P\)

  • 定义:(基于白板的推导) \[P_{\beta\alpha} = \sum_{j=1}^{N/2} C_{\alpha j}^* C_{\beta j}\] (注:标准的化学定义通常是 \(P_{\beta\alpha} = 2 \sum_{j=1}^{N/2} C_{\alpha j}^* C_{\beta j}\)。白板似乎将因子 2 留在了外面,但最终目的是一样的。)

  • \(G_{\mu\nu}\) 的最终形式: 将 \(P\) 代入,双电子排斥矩阵 \(G_{\mu\nu}\) 可以写为: \[G_{\mu\nu} = \sum_{\alpha, \beta} P_{\beta\alpha} \left[ 2 \langle\mu\alpha|\hat{g}|\nu\beta\rangle - \langle\mu\alpha|\hat{g}|\beta\nu\rangle \right]\]

总结:自洽循环 (SCF Loop)

白板上的推导最终表明,Fock 矩阵 \(F_{\mu\nu}\) 的完整表达式为:

\[F_{\mu\nu} = H_{\mu\nu}^{\text{core}} + \sum_{\alpha, \beta} P_{\beta\alpha} \left[ 2 \langle\mu\alpha|\hat{g}|\nu\beta\rangle - \langle\mu\alpha|\hat{g}|\beta\nu\rangle \right]\]

这完美地展示了 SCF (自洽场) 的核心: 1. \(\mathbf{F}\) 依赖于 \(\mathbf{P}\):要构建 Fock 矩阵 \(\mathbf{F}\),你必须知道密度矩阵 \(\mathbf{P}\)。 2. \(\mathbf{P}\) 依赖于 \(\mathbf{C}\):要构建密度矩阵 \(\mathbf{P}\),你必须知道轨道系数矩阵 \(\mathbf{C}\)(通过 \(P_{\beta\alpha} = \sum_j C_{\alpha j}^* C_{\beta j}\))。 3. \(\mathbf{C}\) 依赖于 \(\mathbf{F}\):要找到系数矩阵 \(\mathbf{C}\),你必须求解 Roothaan-Hall 方程 \(\mathbf{F} \mathbf{C} = \mathbf{S} \mathbf{C} \mathbf{\epsilon}\)

这就形成了一个闭环,必须通过迭代求解: 猜测 \(\mathbf{C}^{(0)}\) \(\to\) 计算 \(\mathbf{P}^{(0)}\) \(\to\) 构建 \(\mathbf{F}^{(0)}\) \(\to\) 求解 \(\mathbf{F}^{(0)}\mathbf{C}^{(1)} = \mathbf{S}\mathbf{C}^{(1)}\mathbf{\epsilon}^{(1)}\) 得到新的 \(\mathbf{C}^{(1)}\) \(\to\) … 循环直到 \(\mathbf{C}\) 不再变化。

统计机器学习Lecture-6

Lecturer: Prof.XIA DONG

1. Linear Model Selection and Regularization 线性模型选择与正则化

Summary of Core Concepts

Chapter 6: Linear Model Selection and Regularization, focusing specifically on Section 6.1: Subset Selection. 第六章:线性模型选择与正则化6.1节:子集选择

  • The Problem: You have a dataset with many potential predictor variables (features). If you include all of them (like Model 1 with \(p\) predictors in slide ...221320.png), you risk including “noise” variables. These irrelevant features can decrease model accuracy (overfitting) and make the model difficult to interpret. 数据集包含许多潜在的预测变量(特征)。如果包含所有这些变量(例如幻灯片“…221320.png”中带有\(p\)个预测变量的模型1),则可能会包含“噪声”变量。这些不相关的特征会降低模型的准确率(过拟合),并使模型难以解释。

  • The Goal: Identify a smaller subset of variables that are truly related to the response. This creates a simpler, more interpretable, and often more accurate model (like Model 2 with \(q\) predictors). 找出一个与响应真正相关的较小变量子集。这将创建一个更简单、更易于解释且通常更准确的模型(例如带有\(q\)个预测变量的模型2)。

  • The Main Method Discussed: Best Subset Selection

  • 主要讨论的方法:最佳子集选择 This is an exhaustive search algorithm. It checks every possible combination of predictors to find the “best” model. With \(p\) variables, this means checking \(2^p\) total models. 这是一种穷举搜索算法。它检查所有可能的预测变量组合,以找到“最佳”模型。对于 \(p\) 个变量,这意味着需要检查总共 \(2^p\) 个模型。

    The algorithm (from slide ...221333.png) works in three steps:

    1. Step 1: Fit the “null model” \(M_0\), which has no predictors (it just predicts the average of the response). 拟合“空模型”\(M_0\),它没有预测变量(它只预测响应的平均值)。

    2. Step 2: For each \(k\) (from 1 to \(p\)):

      • Fit all \(\binom{p}{k}\) models that contain exactly \(k\) predictors. (e.g., fit all models with 1 predictor, then all models with 2 predictors, etc.).

      • 拟合所有包含 \(k\) 个预测变量的 \(\binom{p}{k}\) 个模型。(例如,先拟合所有包含 1 个预测变量的模型,然后拟合所有包含 2 个预测变量的模型,等等)。

      • From this group, select the single best model for that size \(k\). This “best” model is the one with the highest \(R^2\) (or lowest RSS - Residual Sum of Squares) on the training data. Call this model \(M_k\).

      • 从这组中,选择 对于该规模 \(k\) 的最佳模型。这个“最佳”模型是在 训练数据 上具有最高 \(R^2\)(或最低 RSS - 残差平方和)的模型。将此模型称为 \(M_k\)

    3. Step 3: You now have \(p+1\) models: \(M_0, M_1, \dots, M_p\). You must select the single best one from this list. To do this, you cannot use training \(R^2\) (as it will always pick the biggest model \(M_p\)). Instead, you must use a metric that estimates test error, such as: 现在你有 \(p+1\) 个模型:\(M_0, M_1, \dots, M_p\)。你必须从列表中选择一个最佳模型。为此,你不能**使用训练 \(R^2\)(因为它总是会选择最大的模型 \(M_p\))。相反,你必须使用一个能够估计测试误差的指标,例如:

      • Cross-Validation (CV) 交叉验证 (CV) (This is what the Python code uses)
      • AIC (Akaike Information Criterion 赤池信息准则)
      • BIC (Bayesian Information Criterion 贝叶斯信息准则)
      • Adjusted \(R^2\) 调整后的 \(R^2\)
  • Key Takeaway: The slides show this “subset selection” concept can be applied beyond linear models. The Python code demonstrates this by applying best subset selection to a K-Nearest Neighbors (KNN) Regressor, a non-linear model.“子集选择”的概念可以应用于线性模型之外

Mathematical Understanding & Key Questions 数学理解与关键问题

This section directly answers the questions posed on your slides.

How to compare which model is better?

(From slides ...221320.png and ...221326.png)

You cannot use training error (like \(R^2\) or RSS) to compare models with different numbers of predictors. A model with more predictors will almost always have a better training score, even if those extra predictors are just noise. This is called overfitting. 不能使用训练误差(例如 \(R^2\) 或 RSS)来比较具有不同数量预测变量的模型。具有更多预测变量的模型几乎总是具有更好的训练分数,即使这些额外的预测变量只是噪声。这被称为过拟合

To compare models of different sizes (like Model 1 vs. Model 2, or \(M_2\) vs. \(M_5\)), you must use a method that estimates test error (how the model performs on new, unseen data). The slides mention: 要比较不同大小的模型(例如模型 1 与模型 2,或 \(M_2\)\(M_5\)),您必须使用一种估算测试误差(模型在新的、未见过的数据上的表现)的方法。

  • Cross-Validation (CV): This is the gold standard. You split your data into “folds,” train the model on some folds, and test it on the remaining fold. You repeat this and average the test scores. The model with the best (e.g., lowest) average CV error is chosen. 将数据分成“折叠”,在一些折叠上训练模型,然后在剩余的折叠上测试模型。重复此操作并取测试分数的平均值。选择平均 CV 误差最小(例如,最小)的模型。

  • AIC & BIC: These are mathematical adjustments to the training error (like RSS) that add a penalty for having more predictors. They balance model fit with model complexity. 这些是对训练误差(如 RSS)的数学调整,会因预测变量较多而增加惩罚。它们平衡了模型拟合度和模型复杂度

Why use \(R^2\) in Step 2?

(From slide ...221333.png)

In Step 2, you are only comparing models of the same size (i.e., all models that have exactly \(k\) predictors). For models with the same number of parameters, a higher \(R^2\) (or lower RSS) on the training data directly corresponds to a better fit. You don’t need to penalize for complexity because all models being compared have the same complexity. 只比较大小相同的模型(即所有恰好具有 \(k\) 个预测变量的模型)。对于参数数量相同的模型,训练数据上更高的 \(R^2\)(或更低的 RSS)直接对应着更好的拟合度。您不需要对复杂度进行惩罚,因为所有被比较的模型都具有相同的复杂度

Why can’t we use training error in Step 3?

(From slide ...221333.png)

In Step 3, you are comparing models of different sizes (\(M_0\) vs. \(M_1\) vs. \(M_2\), etc.). As you add predictors, the training \(R^2\) will always go up (or stay the same), and the training RSS will always go down (or stay the same). If you used \(R^2\) to pick the best model in Step 3, you would always pick the most complex model \(M_p\), which is almost certainly overfit. 将比较不同大小的模型(例如 \(M_0\) vs. \(M_1\) vs. \(M_2\) 等)。随着您添加预测变量,训练 \(R^2\)始终上升(或保持不变),而训练 RSS 将始终下降(或保持不变)。如果您在步骤 3 中使用 \(R^2\) 来选择最佳模型,那么您始终会选择最复杂的模型 \(M_p\),而该模型几乎肯定会过拟合。

Therefore, you must use a metric that estimates test error (like CV) or penalizes for complexity (like AIC, BIC, or Adjusted \(R^2\)) to find the right balance between fit and simplicity. 因此,您必须使用一个可以估算测试误差(例如 CV)或惩罚复杂度(例如 AIC、BIC 或调整后的 \(R^2\))的指标来找到拟合度和简单性之间的平衡。

Code Analysis

The Python code (slides ...221249.jpg and ...221303.jpg) implements the Best Subset Selection algorithm using KNN Regression.

Key Functions

  • main():
    1. Loads Data: Reads the Credit.csv file.
    2. Preprocesses Data:
      • Converts categorical features (‘Gender’, ‘Student’, ‘Married’, ‘Ethnicity’) into numerical ones (dummy variables). 将分类特征(“性别”、“学生”、“已婚”、“种族”)转换为数值特征(虚拟变量)。
      • Creates the feature matrix X and target variable y (‘Balance’). 创建特征矩阵 X 和目标变量 y(“余额”)。
      • Scales the features using StandardScaler. This is crucial for KNN, which is sensitive to the scale of features. 用 StandardScaler 对特征进行缩放。这对于 KNN 至关重要,因为它对特征的缩放非常敏感。
    3. Adds Noise (in the second example): Slide ...221303.jpg shows code that adds 20 new “noisy” columns to the data. This is to test if the selection algorithm is smart enough to ignore them. 向数据中添加 20 个新的“噪声”列的代码。这是为了测试选择算法是否足够智能,能够忽略它们。
    4. Runs Selection: Calls best_subset_selection_parallel to do the main work.
    5. Prints Results: Finds the best subset (lowest error) and prints the top 20 best-performing subsets. 找到最佳子集(误差最小),并打印出表现最佳的前 20 个子集。
    6. Final Evaluation: It re-trains a KNN model on only the best subset and calculates the final cross-validated RMSE. 仅基于最佳子集重新训练 KNN 模型,并计算最终的交叉验证 RMSE。
  • evaluate_subset(subset, ...):
    • This is the “worker” function. It’s called for every single possible subset.
    • It takes a subset (a list of feature names, e.g., ['Income', 'Limit']).
    • It creates a new X_subset containing only those columns.
    • It runs 5-fold cross-validation (cross_val_score) on a KNN model using this X_subset.
    • It uses 'neg_mean_squared_error' as the metric. This is negative MSE; a higher score (closer to 0) is better. 它会创建一个新的“X_subset”,仅包含这些列。 它会使用此“X_subset”在 KNN 模型上运行 5 倍交叉验证(“cross_val_score”)。 它使用“neg_mean_squared_error”作为度量标准。这是负 MSE;更高*的分数(越接近 0)越好。
    • It returns the subset and its average CV score.
  • best_subset_selection_parallel(model, ...):
    • This is the “manager” function.这是“管理器”函数。
    • It iterates from k=1 up to the total number of features.它从“k=1”迭代到特征总数。
    • For each k, it generates all combinations of features of that size (this is the \(\binom{p}{k}\) part). 对于每个“k”,它会生成该大小的特征的所有组合(这是 \(\binom{p}{k}\) 部分)。
    • It uses Parallel and delayed (from joblib) to run evaluate_subset for all these combinations in parallel, speeding up the process significantly. 它使用 Paralleldelayed(来自 joblib)对所有这些组合并行运行 evaluate_subset,从而显著加快了处理速度。
    • It collects all the results and returns them.它收集所有结果并返回。

Analysis of the Output

  • Slide ...221255.png (Original Data):
    • The code runs subset selection on the original dataset.
    • The “Top 20 Best Feature Subsets” are shown. The CV scores are negative (they are neg_mean_squared_error), so the scores closest to zero (smallest magnitude) are best.
    • The Best feature subset is found to be ('Income', 'Limit', 'Rating', 'Student').
    • The final cross-validated RMSE for this model is 105.41.
  • Slide ...221309.png (Data with 20 Noisy Variables):
    • The code is re-run after adding 20 useless “Noisy” features.
    • The algorithm still works. It correctly identifies that the “Noisy” variables are useless.
    • The Best feature subset is now ('Income', 'Limit', 'Student'). (Note: ‘Rating’ was dropped, likely because it’s highly correlated with ‘Limit’, and the noisy data made the simpler model perform slightly better in CV).
    • The final RMSE is 114.94. This is higher than the original 105.41, which is expected—the presence of so many noise variables makes the selection problem harder, but the final model is still good and, most importantly, it successfully excluded all 20 noisy features. 最终的 RMSE 为 114.94。这比最初的 105.41更高,这是预期的——如此多的噪声变量的存在使得选择问题更加困难,但最终模型仍然很好,最重要的是,它成功地排除了所有 20 个噪声特征

Conceptual Overview: The “Why”

Slides cover Chapter 6: Linear Model Selection and Regularization, which is all about a fundamental trade-off in machine learning: the bias-variance trade-off. 该部分主要讨论机器学习中的一个基本权衡:偏差-方差权衡

  • The Problem (Slide ...221320.png): Imagine you have a dataset with 50 predictors (\(p=50\)). You want to predict a response \(y\). 假设你有一个包含 50 个预测变量(p=50)的数据集。你想要预测响应 \(y\)

    • Model 1 (Full Model): You use all 50 predictors. This model is very flexible. It will fit the training data extremely well, resulting in a low bias. However, it’s highly likely that many of those 50 predictors are just “noise” (random, unrelated variables). By fitting to this noise, the model will be overfit. When you show it new, unseen data (the test data), it will perform poorly. This is called high variance. 你使用了所有 50 个预测变量。这个模型非常灵活。它能很好地拟合训练数据,从而产生较低的偏差。然而,这 50 个预测变量中很可能有很多只是“噪声”(随机的、不相关的变量)。由于拟合这些噪声,模型会过拟合。当你向它展示新的、未见过的数据(测试数据)时,它的表现会很差。这被称为高方差
    • Model 2 (Subset Model): You intelligently select only the 3 predictors (\(q=3\)) that are actually related to \(y\). This model is less flexible. It won’t fit the training data as perfectly as Model 1 (it has higher bias). But, because it’s not fitting the noise, it will generalize much better to new data. It will have a much lower variance, and thus a lower overall test error. 你智能地只选择与 \(y\) 真正相关的 3 个预测变量 (\(q=3\))。这个模型的灵活性较差。它对 训练数据 的拟合度不如模型 1 完美(它的偏差更高)。但是,由于它对噪声的拟合度更高,因此对新数据的泛化能力会更好。它的方差会更低,因此总体的测试误差也会更低。
  • The Goal: The goal is to find the model that has the lowest test error. We need a formal method to find the best subset (like Model 2) without just guessing. 目标是找到测试误差**最低的模型。我们需要一个正式的方法来找到最佳子集(例如模型 2),而不是仅仅靠猜测。

  • Two Main Strategies (Slide ...221314.png):

    1. Subset Selection (Section 6.1): This is what we’re focused on. It’s an “all-or-nothing” approach. You either keep a variable in the model or you discard it completely. The “Best Subset Selection” algorithm is the most extreme, “brute-force” way to do this. 是我们关注的重点。这是一种“全有或全无”的方法。你要么在模型中“保留”一个变量,要么“彻底丢弃”它。“最佳子集选择”算法是最极端、最“暴力”的做法。

    2. Shrinkage/Regularization (Section 6.2): This is a more subtle approach (e.g., Ridge Regression, LASSO). Instead of discarding variables, you keep all \(p\) variables but add a penalty to the model that “shrinks” the coefficients (\(\beta\)) of the useless variables towards zero. 这是一种更巧妙的方法(例如,岭回归、LASSO)。你不是丢弃变量,而是保留所有 \(p\) 个变量,但会给模型添加一个惩罚项,将无用变量的系数(\(\beta\))“收缩”到零。

Questions 🎯

Q1: “How to compare which model is better?”

(From slides ...221320.png and ...221326.png)

This is the most important question. You cannot use metrics based on training data (like \(R^2\) or RSS - Residual Sum of Squares) to compare models with different numbers of predictors. 这是最重要的问题。您不能使用基于训练数据的指标(例如 R^2 或 RSS - 残差平方和)来比较具有不同数量预测变量的模型。

  • The Trap: A model with more predictors will always have a higher \(R^2\) (or lower RSS) on the data it was trained on. \(R^2\) will always increase as you add variables, even if they are pure noise. If you used \(R^2\) to compare a 3-predictor model to a 10-predictor model, the 10-predictor model would always look better on paper, even if it’s terribly overfit. 具有更多预测变量的模型在其训练数据上总是具有更高的 R^2(或更低的 RSS)。随着变量的增加,R^2 会总是增加,即使这些变量是纯噪声。如果您使用 R^2 来比较 3 个预测变量的模型和 10 个预测变量的模型,那么 10 个预测变量的模型在纸面上总是看起来更好,即使它严重过拟合。

  • The Correct Way: You must use a metric that estimates the test error. The slides and code show two ways:您必须使用一个能够估计测试误差的指标。

    1. Cross-Validation (CV): This is the method used in your Python code. It works by:
      • Splitting your training data into \(k\) “folds” (e.g., 5 folds). 将训练数据拆分成 \(k\) 个“折叠”(例如 5 个折叠)。
      • Training the model on 4 folds and testing it on the 5th fold. 使用其中 4 个折叠训练模型,并使用第 5 个折叠进行测试。
      • Repeating this 5 times, so each fold gets to be the test set once. 重复此操作 5 次,使每个折叠都作为测试集一次。
      • Averaging the 5 test errors. 对 5 个测试误差求平均值。 This gives you a robust estimate of how your model will perform on unseen data. You then choose the model with the best (lowest) average CV error. 这可以让你对模型在未见数据上的表现有一个稳健的估计。然后,你可以选择平均 CV 误差最小(最佳)的模型。
    2. Mathematical Adjustments (AIC, BIC, Adjusted \(R^2\)): These are formulas that take the training error (like RSS) and add a penalty for each predictor (\(k\)) you add.
      • \(AIC \approx RSS + 2k\sigma^2\)
      • \(BIC \approx RSS + \log(n)k\sigma^2\) A model with more predictors (larger \(k\)) gets a bigger penalty. To be chosen, a more complex model must significantly improve the RSS to overcome this penalty. 预测变量越多(k 越大)的模型,惩罚越大。要被选中,更复杂的模型必须显著提升 RSS 以克服此惩罚。

Q2: “Why using \(R^2\) for step 2?”

(From slide ...221333.png)

Step 2 of the “Best Subset Selection” algorithm says: “For \(k = 1, \dots, p\): Fit all \(\binom{p}{k}\) models… Pick the best model, that with the largest \(R^2\), … and call it \(M_k\).” “对于 \(k = 1, \dots, p\):拟合所有 \(\binom{p}{k}\) 个模型……选择具有最大 \(R^2\) 的最佳模型……并将其命名为 \(M_k\)。”

  • The Reason: In Step 2, you are only comparing models of the same size. For example, when \(k=3\), you are comparing all possible 3-predictor models: 步骤 2 中,您比较**相同大小的模型。例如,当 \(k=3\) 时,您将比较所有可能的 3 预测变量模型:
    • Model A: (\(X_1, X_2, X_3\))
    • Model B: (\(X_1, X_2, X_4\))
    • Model C: (\(X_1, X_3, X_5\))
    • …and so on.
    Since all these models have the exact same complexity (they all have \(k=3\) predictors), there is no risk of unfairly favoring a more complex model. Therefore, you are free to use a training metric like \(R^2\) (or RSS). The model with the highest \(R^2\) is, by definition, the one that best fits the training data for that specific size \(k\). 由于所有这些模型都具有完全相同的复杂度(它们都具有 \(k=3\) 个预测变量),因此不存在不公平地偏向更复杂模型的风险。因此,您可以自由使用像 \(R^2\)(或 RSS)这样的训练指标。根据定义,具有最高 \(R^2\) 的模型就是在特定大小 \(k\)与训练数据拟合度最高的模型。

Q3: “Cannot use training error in Step 3.” Why not? “步骤 3 中不能使用训练误差。” 为什么?

(From slide ...221333.png)

Step 3 says: “Select a single best model from \(M_0, M_1, \dots, M_p\) by cross validation, AIC, or BIC.”“通过交叉验证、AIC 或 BIC,从 \(M_0、M_1、\dots、M_p\) 中选择一个最佳模型。”

  • The Reason: In Step 3, you are now comparing models of different sizes. You are comparing the best 1-predictor model (\(M_1\)) vs. the best 2-predictor model (\(M_2\)) vs. the best 3-predictor model (\(M_3\)), and so on, all the way up to \(M_p\). 在步骤 3 中,您正在比较不同大小的模型。您正在比较最佳的单预测模型 (\(M_1\))、最佳的双预测模型 (\(M_2\)) 和最佳的三预测模型 (\(M_3\)),依此类推,直到 \(M_p\)

    As explained in Q1, if you used a training error metric like \(R^2\) here, the \(R^2\) would just keep going up, and you would always select the largest, most complex model, \(M_p\). This completely defeats the purpose of model selection. 如问题 1 所述,如果您在此处使用像 \(R^2\) 这样的训练误差指标,那么 \(R^2\) 会持续上升,并且您总是会选择最大、最复杂的模型 \(M_p\)。这完全违背了模型选择的目的。

    Therefore, in Step 3, you must use a method that estimates test error (like Cross-Validation) or one that penalizes for complexity (like AIC or BIC) to find the “sweet spot” model that balances fit and simplicity. 因此,在步骤 3 中,您必须使用一种估算测试误差的方法(例如交叉验证)或惩罚复杂性的方法(例如 AIC 或 BIC),以找到在拟合度和简单性之间取得平衡的“最佳点”模型。

Mathematical Deep Dive 🧮

  • \(Y = \beta_0 + \beta_1X_1 + \dots + \beta_pX_p + \epsilon\): The full linear model. The goal of subset selection is to find a subset of \(X_j\)’s where \(\beta_j \neq 0\) and set all other \(\beta\)’s to 0. 完整的线性模型。子集选择的目标是找到 \(X_j\) 的一个子集,其中 $_j 等于 0,并将所有其他 \(\beta\) 设置为 0。
  • \(2^p\) combinations: (Slide ...221333.png) This is the total number of models you have to check. For each of the \(p\) variables, you have two choices: either it is IN the model or it is OUT.这是你需要检查的模型总数。对于每个 \(p\) 个变量,你有两个选择:要么它在模型内部,要么它在模型外部
    • Example: \(p=3\) (variables \(X_1, X_2, X_3\))
    • The \(2^3 = 8\) possible models are:
      1. {} (The null model, \(M_0\))
      2. { \(X_1\) }
      3. { \(X_2\) }
      4. { \(X_3\) }
      5. { \(X_1, X_2\) }
      6. { \(X_1, X_3\) }
      7. { \(X_2, X_3\) }
      8. { \(X_1, X_2, X_3\) } (The full model, \(M_3\))
    • This is why this method is called an “exhaustive search”. It literally checks every single one. For \(p=20\), \(2^{20}\) is over a million models!这就是该方法被称为“穷举搜索”的原因。它实际上会检查每一个模型。对于 \(p=20\)\(2^{20}\) 就超过一百万个模型!
  • \(\binom{p}{k} = \frac{p!}{k!(p-k)!}\): (Slide ...221333.png) This is the “combinations” formula. It tells you how many models you fit in Step 2 for a specific \(k\).这是“组合”公式。它告诉你,对于特定的 \(k\)在步骤 2中,你拟合了 多少 个模型。
    • Example: \(p=10\) total predictors.
    • For \(k=1\): You fit \(\binom{10}{1} = 10\) models.
    • For \(k=2\): You fit \(\binom{10}{2} = \frac{10 \times 9}{2 \times 1} = 45\) models.
    • For \(k=3\): You fit \(\binom{10}{3} = \frac{10 \times 9 \times 8}{3 \times 2 \times 1} = 120\) models.
    • …and so on. The sum of all these \(\binom{p}{k}\) from \(k=0\) to \(k=p\) equals \(2^p\).

Detailed Code Analysis 💻

Your slides show Python code that applies the Best Subset Selection algorithm to a KNN Regressor. This is a great example of how the selection algorithm is independent of the model type (as mentioned in slide ...221314.png).

Key Functions

  • main()
    1. Load & Preprocess: Reads Credit.csv. The most important step here is converting categorical text (like ‘Male’/‘Female’) into numbers (1/0).
    2. Scale Data: scaler = StandardScaler() and X_scaled = scaler.fit_transform(X).
      • WHY? This is CRITICAL for KNN. KNN works by measuring distance. If ‘Income’ (e.g., 50,000) is on a vastly different scale than ‘Cards’ (e.g., 3), the ‘Income’ feature will completely dominate the distance calculation, making ‘Cards’ irrelevant. Scaling resizes all features to have a mean of 0 and standard deviation of 1, so they all contribute fairly.
    3. Handle Noisy Data (Slide ...221303.jpg): This version of the code intentionally adds 20 columns of useless, random numbers. This is a test to see if the algorithm is smart enough to ignore them.
    4. Run Selection: results_df = best_subset_selection_parallel(...). This function does all the heavy lifting (explained next).
    5. Find Best Model: results_df.sort_values(by='CV_Score', ascending=False).
      • WHY ascending=False? The code uses the metric 'neg_mean_squared_error'. This is MSE, but negative (e.g., -15000). A better model has an error closer to 0 (e.g., -10000). Since -10000 is greater than -15000, you sort in descending (high-to-low) order to put the best models at the top.
    6. Final Evaluation (Step 3): final_scores = cross_val_score(knn, X_best, y, ...)
      • This is the implementation of Step 3. It takes only the single best subset (X_best) and runs a new cross-validation on it. This gives a final, unbiased estimate of how good that one model is.
    7. Print RMSE: final_rmse = np.sqrt(-final_scores). It converts the negative MSE back into a positive RMSE (Root Mean Squared Error), which is in the same units as the target \(y\) (in this case, ‘Balance’ in dollars).
  • best_subset_selection_parallel(model, ...)
    1. This is the “manager” function. It implements the loop from Step 2.
    2. for k in range(1, n_features + 1): This is the loop “For \(k = 1, \dots, p\)”.
    3. subsets = list(combinations(feature_names, k)): This generates the \(\binom{p}{k}\) combinations for the current \(k\).
    4. results = Parallel(n_jobs=n_jobs)(...): This is a non-core, “speed-up” command. It uses the joblib library to run the evaluations on all your computer’s CPU cores at once (in parallel). Without this, checking millions of models would take days.
    5. subset_scores = ... [delayed(evaluate_subset)(...) ...] This line farms out the actual work to the evaluate_subset function for every single subset.
  • evaluate_subset(subset, ...)
    1. This is the “worker” function. It gets called thousands or millions of times.
    2. Its job is to evaluate one single subset (e.g., ('Income', 'Limit', 'Student')).
    3. X_subset = X[list(subset)]: It slices the data to get only these columns.
    4. scores = cross_val_score(model, X_subset, ...): This is the most important line. It takes the subset and performs a full 5-fold cross-validation on it.
    5. return (subset, np.mean(scores)): It returns the subset and its average CV score.

Summary of Outputs (Slides ...221255.png & ...221309.png)

  • Original Data (Slide ...221255.png):
    • Best Subset: ('Income', 'Limit', 'Rating', 'Student')
    • Final RMSE: ~105.4
  • Data with 20 “Noisy” Variables (Slide ...221309.png):
    • Best Subset: ('Income', 'Limit', 'Student')
    • Result: The algorithm successfully identified that all 20 “Noisy” variables were useless and excluded every single one of them from the best models.
    • Final RMSE: ~114.9
    • Key Takeaway: The RMSE is slightly higher, which makes sense because the selection problem was much harder. But the method worked perfectly. It filtered all the “noise” and found a simple, powerful model, just as the theory on slide ...221320.png predicted.

2. The Core Problem: Training Error vs. Test Error 核心问题:训练误差 vs. 测试误差

The central theme of these slides is finding the “best” model. The problem is that a model with more predictors (more complex) will always fit the data it was trained on better. This is a trap. 寻找“最佳”模型。问题在于,预测因子越多(越复杂)的模型总是能更好地拟合训练数据。这是一个陷阱。

  • Training Error: How well the model fits the data we used to build it. \(R^2\) and \(RSS\) measure this. 模型与我们构建模型时所用数据的拟合程度。\(R^2\)\(RSS\) 衡量了这一点。
  • Test Error: How well the model predicts new, unseen data. This is what we actually care about. A model that is too complex (e.g., has 10 predictors when only 3 are useful) will have low training error but very high test error. This is called overfitting. 模型预测新的、未见过的数据的准确程度。这才是我们真正关心的。过于复杂的模型(例如,有 10 个预测因子,但只有 3 个有用)的训练误差会很低,但测试误差会很高。这被称为过拟合

The goal is to choose a model that has the lowest test error. The metrics below (Adjusted \(R^2\), AIC, BIC) are all attempts to estimate this test error without having to actually collect new data. They do this by adding a penalty for complexity. 目标是选择一个具有最低测试误差的模型。以下指标(调整后的 \(R^2\)、AIC、BIC)都是在无需实际收集新数据的情况下尝试估计此测试误差。他们通过增加复杂度惩罚来实现这一点。

Basic Metrics (Measures of Fit)

These formulas from slide 13 describe how well a model fits the training data.

Residue (Error) 残差(误差)

  • Formula: \(\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_j x_{ij}\)
  • Concept: This is the most basic building block. It’s the difference between the actual observed value (\(y_i\)) and the value your model predicted (\(\hat{y}_i\)). It is the “error” for a single data point. 这是最基本的构建块。它是实际观测值 (\(y_i\)) 与模型*预测值 (\(\hat{y}_i\)) 之间的差值。它是单个数据点的“误差”。

Residual Sum of Squares (RSS) 残差平方和 (RSS)

  • Formula: \(RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^2\)
  • Concept: This is the overall measure of model error. You square all the individual errors (residues) to make them positive and then add them all up. 这是模型误差的总体度量。将所有单个误差(残差)平方,使其为正,然后将它们全部相加。
  • Goal: The entire process of linear regression (called “Ordinary Least Squares”) is designed to find the \(\hat{\beta}\) coefficients that make this RSS value as small as possible. 整个线性回归过程(称为“普通最小二乘法”)旨在找到使RSS 值尽可能小\(\hat{\beta}\) 个系数。
  • The Flaw 缺陷: \(RSS\) will always decrease (or stay the same) as you add more predictors (\(p\)). A model with all 10 predictors will have a lower \(RSS\) than a model with 9, even if that 10th predictor is useless. Therefore, \(RSS\) is useless for choosing between models of different sizes. 随着预测变量 (\(p\)) 的增加,\(RSS\) 总是会减小(或保持不变)。一个包含所有 10 个预测变量的模型的 \(RSS\) 会低于一个包含 9 个预测变量的模型,即使第 10 个预测变量毫无用处。因此,\(RSS\) 对于在不同规模的模型之间进行选择毫无用处。

R-squared (\(R^2\))

  • Formula: \(R^2 = 1 - \frac{SS_{error}}{SS_{total}} = 1 - \frac{RSS}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\)
  • Concept: This metric reframes \(RSS\) into a more interpretable percentage.此指标将 \(RSS\) 重新定义为更易于解释的百分比。
    • \(SS_{total}\) (the denominator) represents the total variance of the data. It’s the error you would get if your “model” was just guessing the average value (\(\bar{y}\)) for every single observation. (分母)表示数据的总方差。如果你的“模型”只是猜测每个观测值的平均值 (\(\bar{y}\)),那么你就会得到这个误差。
    • \(SS_{error}\) (the \(RSS\)) is the error after using your model. 是“模型解释的总方差的比例”。 \(R^2\) 为 0.75 意味着你的模型可以解释响应变量 75% 的变异。
    • \(R^2\) is the “proportion of total variance explained by the model.” An \(R^2\) of 0.75 means your model can explain 75% of the variation in the response variable.
  • The Flaw 缺陷: Just like \(RSS\), \(R^2\) will always increase (or stay the same) as you add more predictors. This is visually confirmed in Figure 6.1, where the red line for \(R^2\) only goes up. It will always pick the most complex model. 与 \(RSS\) 一样,随着预测变量的增加,\(R^2\)始终增加(或保持不变)。图 6.1 直观地证实了这一点,其中 \(R^2\) 的红线只会上升。它总是会选择最复杂的模型。

Advanced Metrics (For Model Selection) 高级指标(用于模型选择)

These metrics “fix” the flaw of \(R^2\) by including a penalty for the number of predictors.

Adjusted \(R^2\)

  • Formula: \[ \text{Adjusted } R^2 = 1 - \frac{RSS / (n - p - 1)}{SS_{total} / (n - 1)} \]
  • Mathematical Concept: This formula replaces the “Sum of Squares” (\(SS\)) with “Mean Squares” (\(MS\)).
    • \(MS_{error} = \frac{RSS}{n-p-1}\)
    • \(MS_{total} = \frac{SS_{total}}{n-1}\)
  • The “Penalty” Explained: The penalty is degrees of freedom.
    • \(n\) = number of data points.
    • \(p\) = number of predictors.
    • The term \(n-p-1\) is the degrees of freedom for the residuals. You start with \(n\) data points, but you “use up” one degree of freedom to estimate the intercept (\(\hat{\beta}_0\)) and \(p\) more to estimate the \(p\) slopes.
  • How it Works:
    1. When you add a new predictor (increase \(p\)), \(RSS\) goes down, which makes the numerator (\(MS_{error}\)) smaller.
    2. …But, increasing \(p\) also decreases the denominator (\(n-p-1\)), which makes the numerator (\(MS_{error}\)) larger.
    • This creates a “tug-of-war.” If the new predictor is useful, it will drop \(RSS\) a lot, and Adjusted \(R^2\) will increase. If the new predictor is useless, \(RSS\) will barely change, and the penalty from decreasing the denominator will win, causing Adjusted \(R^2\) to decrease.
  • Goal: You select the model with the highest Adjusted \(R^2\).

Akaike Information Criterion (AIC)

  • General Formula: \(AIC = -2 \log \ell(\hat{\theta}) + 2d\)
  • Concept Breakdown:
    • \(\ell(\hat{\theta})\): This is the Maximized Likelihood Function.
      • The Likelihood Function \(\ell(\theta)\) asks: “Given a set of model parameters \(\theta\), how probable is the data we observed?”
      • The Maximum Likelihood Estimate (MLE) \(\hat{\theta}\) is the specific set of parameters (the \(\hat{\beta}\)’s) that maximizes this probability.
    • \(\log \ell(\hat{\theta})\): The log-likelihood. This is just a number that represents the best possible fit the model can achieve for the data. A higher number is a better fit.
    • \(-2 \log \ell(\hat{\theta})\): This is the Deviance. Since a higher log-likelihood is better, a lower deviance is better. This term measures poorness-of-fit.
    • \(d\): The number of parameters estimated by the model. (e.g., \(p\) predictors + 1 intercept).
    • \(2d\): This is the Penalty Term.
  • How it Works: \(AIC = (\text{Poorness-of-Fit}) + (\text{Complexity Penalty})\). As you add predictors, the fit gets better (the deviance term goes down), but the penalty term (\(2d\)) goes up.
  • Goal: You select the model with the lowest AIC.

Bayesian Information Criterion (BIC)

  • General Formula: \(BIC = -2 \log \ell(\hat{\theta}) + \log(n)d\)
  • Concept: This is mathematically identical to AIC, but the penalty term is different.
    • AIC Penalty: \(2d\)
    • BIC Penalty: \(\log(n)d\)
  • Comparison:
    • \(n\) is the number of observations in your dataset.
    • As long as your dataset has 8 or more observations (\(n \ge 8\)), \(\log(n)\) will be greater than 2.
    • This means BIC applies a much harsher penalty for complexity than AIC.
  • Consequence: BIC will tend to choose simpler models (fewer predictors) than AIC.
  • Goal: You select the model with the lowest BIC.

The Deeper Theory: Why AIC Works

Slide 27 (“Understanding AIC”) gives the deep mathematical justification.

  • Goal: We have a true, unknown process \(p\) that generates our data. We are creating a model \(\hat{p}_j\). We want our model to be as “close” to the truth as possible.
  • Kullback-Leibler (K-L) Distance: This is a function \(K(p, \hat{p}_j)\) that measures the “information lost” when you use your model \(\hat{p}_j\) to approximate the truth \(p\). You want to minimize this distance.
  • The Math:
    1. \(K(p, \hat{p}_j) = \int p(y) \log \left( \frac{p(y)}{\hat{p}_j(y)} \right) dy\)
    2. This splits into: \(K(p, \hat{p}_j) = \underbrace{\int p(y) \log(p(y)) dy}_{\text{Constant}} - \underbrace{\int p(y) \log(\hat{p}_j(y)) dy}_{\text{This is what we need to maximize}}\)
  • The Problem: We can’t calculate that second term because it requires knowing the true function \(p\).
  • Akaike’s Insight: Akaike proved that the log-likelihood we can calculate, \(\log \ell(\hat{\theta})\), is a biased estimator of that target. He also proved that the bias is approximately \(-d\).
  • The Solution: An unbiased estimate of the target is \(\log \ell(\hat{\theta}) - d\).
  • Final Step: For historical and statistical reasons, he multiplied this by \(-2\) to create the final AIC formula.
  • Conclusion: AIC is not just a random formula. It is a carefully derived estimate of how much information your model loses compared to the “truth” (i.e., its expected performance on new data).

AIC/BIC for Linear Regression

Slide 26 shows how these general formulas simplify for linear regression (assuming normal, Gaussian errors).

  • General Formula: \(AIC = -2 \log \ell(\hat{\theta}) + 2d\)
  • Linear Regression Formula: \(AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)\)

Key Insight: For linear regression, the “poorness-of-fit” term (\(-2 \log \ell(\hat{\theta})\)) is directly proportional to the \(RSS\).

This makes it much easier to understand. You can just think of the formulas as: * AIC \(\approx\) \(RSS + 2d\hat{\sigma}^2\) * BIC \(\approx\) \(RSS + \log(n)d\hat{\sigma}^2\)

(Here \(\hat{\sigma}^2\) is an estimate of the error variance, which can often be treated as a constant).

This clearly shows the trade-off: We want a model with a low \(RSS\) (good fit) and a low \(d\) (low complexity). These two goals are in direct competition.

Mallow’s \(C_p\): The slide notes that \(C_p\) is equivalent to AIC for linear regression. The \(C_p\) formula is \(C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2_{full})\), where \(\hat{\sigma}^2_{full}\) is the error variance estimated from the full model. Since \(n\) and \(\hat{\sigma}^2_{full}\) are constants, minimizing \(C_p\) is mathematically identical to minimizing \(RSS + 2d\hat{\sigma}^2_{full}\), which is the same logic as AIC.

Here is a detailed breakdown of the mathematical formulas and concepts from your slides.

The Core Problem: Training Error vs. Test Error

The central theme of these slides is finding the “best” model. The problem is that a model with more predictors (more complex) will always fit the data it was trained on better. This is a trap.

  • Training Error: How well the model fits the data we used to build it. \(R^2\) and \(RSS\) measure this.
  • Test Error: How well the model predicts new, unseen data. This is what we actually care about. A model that is too complex (e.g., has 10 predictors when only 3 are useful) will have low training error but very high test error. This is called overfitting.

The goal is to choose a model that has the lowest test error. The metrics below (Adjusted \(R^2\), AIC, BIC) are all attempts to estimate this test error without having to actually collect new data. They do this by adding a penalty for complexity.

Basic Metrics (Measures of Fit)

These formulas from slide 13 describe how well a model fits the training data.

Residue (Error)

  • Formula: \(\hat{\epsilon}_i = y_i - \hat{y}_i = y_i - \hat{\beta}_0 - \sum_{j=1}^{p} \hat{\beta}_j x_{ij}\)
  • Concept: This is the most basic building block. It’s the difference between the actual observed value (\(y_i\)) and the value your model predicted (\(\hat{y}_i\)). It is the “error” for a single data point.

Residual Sum of Squares (RSS)

  • Formula: \(RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^2\)
  • Concept: This is the overall measure of model error. You square all the individual errors (residues) to make them positive and then add them all up.
  • Goal: The entire process of linear regression (called “Ordinary Least Squares”) is designed to find the \(\hat{\beta}\) coefficients that make this RSS value as small as possible.
  • The Flaw: \(RSS\) will always decrease (or stay the same) as you add more predictors (\(p\)). A model with all 10 predictors will have a lower \(RSS\) than a model with 9, even if that 10th predictor is useless. Therefore, \(RSS\) is useless for choosing between models of different sizes.

R-squared (\(R^2\))

  • Formula: \(R^2 = 1 - \frac{SS_{error}}{SS_{total}} = 1 - \frac{RSS}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\)
  • Concept: This metric reframes \(RSS\) into a more interpretable percentage.
    • \(SS_{total}\) (the denominator) represents the total variance of the data. It’s the error you would get if your “model” was just guessing the average value (\(\bar{y}\)) for every single observation.
    • \(SS_{error}\) (the \(RSS\)) is the error after using your model.
    • \(R^2\) is the “proportion of total variance explained by the model.” An \(R^2\) of 0.75 means your model can explain 75% of the variation in the response variable.
  • The Flaw: Just like \(RSS\), \(R^2\) will always increase (or stay the same) as you add more predictors. This is visually confirmed in Figure 6.1, where the red line for \(R^2\) only goes up. It will always pick the most complex model.

Advanced Metrics (For Model Selection)

These metrics “fix” the flaw of \(R^2\) by including a penalty for the number of predictors.

Adjusted \(R^2\)

  • Formula: \[ \text{Adjusted } R^2 = 1 - \frac{RSS / (n - p - 1)}{SS_{total} / (n - 1)} \]
  • Mathematical Concept: This formula replaces the “Sum of Squares” (\(SS\)) with “Mean Squares” (\(MS\)).
    • \(MS_{error} = \frac{RSS}{n-p-1}\)
    • \(MS_{total} = \frac{SS_{total}}{n-1}\)
  • The “Penalty” Explained: The penalty is degrees of freedom.
    • \(n\) = number of data points.
    • \(p\) = number of predictors.
    • The term \(n-p-1\) is the degrees of freedom for the residuals. You start with \(n\) data points, but you “use up” one degree of freedom to estimate the intercept (\(\hat{\beta}_0\)) and \(p\) more to estimate the \(p\) slopes.
  • How it Works:
    1. When you add a new predictor (increase \(p\)), \(RSS\) goes down, which makes the numerator (\(MS_{error}\)) smaller.
    2. …But, increasing \(p\) also decreases the denominator (\(n-p-1\)), which makes the numerator (\(MS_{error}\)) larger.
    • This creates a “tug-of-war.” If the new predictor is useful, it will drop \(RSS\) a lot, and Adjusted \(R^2\) will increase. If the new predictor is useless, \(RSS\) will barely change, and the penalty from decreasing the denominator will win, causing Adjusted \(R^2\) to decrease.
  • Goal: You select the model with the highest Adjusted \(R^2\).

Akaike Information Criterion (AIC)

  • General Formula: \(AIC = -2 \log \ell(\hat{\theta}) + 2d\)
  • Concept Breakdown:
    • \(\ell(\hat{\theta})\): This is the Maximized Likelihood Function.
      • The Likelihood Function \(\ell(\theta)\) asks: “Given a set of model parameters \(\theta\), how probable is the data we observed?”
      • The Maximum Likelihood Estimate (MLE) \(\hat{\theta}\) is the specific set of parameters (the \(\hat{\beta}\)’s) that maximizes this probability.
    • \(\log \ell(\hat{\theta})\): The log-likelihood. This is just a number that represents the best possible fit the model can achieve for the data. A higher number is a better fit.
    • \(-2 \log \ell(\hat{\theta})\): This is the Deviance. Since a higher log-likelihood is better, a lower deviance is better. This term measures poorness-of-fit.
    • \(d\): The number of parameters estimated by the model. (e.g., \(p\) predictors + 1 intercept).
    • \(2d\): This is the Penalty Term.
  • How it Works: \(AIC = (\text{Poorness-of-Fit}) + (\text{Complexity Penalty})\). As you add predictors, the fit gets better (the deviance term goes down), but the penalty term (\(2d\)) goes up.
  • Goal: You select the model with the lowest AIC.

Bayesian Information Criterion (BIC)

  • General Formula: \(BIC = -2 \log \ell(\hat{\theta}) + \log(n)d\)
  • Concept: This is mathematically identical to AIC, but the penalty term is different.
    • AIC Penalty: \(2d\)
    • BIC Penalty: \(\log(n)d\)
  • Comparison:
    • \(n\) is the number of observations in your dataset.
    • As long as your dataset has 8 or more observations (\(n \ge 8\)), \(\log(n)\) will be greater than 2.
    • This means BIC applies a much harsher penalty for complexity than AIC.
  • Consequence: BIC will tend to choose simpler models (fewer predictors) than AIC.
  • Goal: You select the model with the lowest BIC.

The Deeper Theory: Why AIC Works

Slide 27 (“Understanding AIC”) gives the deep mathematical justification.

  • Goal: We have a true, unknown process \(p\) that generates our data. We are creating a model \(\hat{p}_j\). We want our model to be as “close” to the truth as possible.
  • Kullback-Leibler (K-L) Distance: This is a function \(K(p, \hat{p}_j)\) that measures the “information lost” when you use your model \(\hat{p}_j\) to approximate the truth \(p\). You want to minimize this distance.
  • The Math:
    1. \(K(p, \hat{p}_j) = \int p(y) \log \left( \frac{p(y)}{\hat{p}_j(y)} \right) dy\)
    2. This splits into: \(K(p, \hat{p}_j) = \underbrace{\int p(y) \log(p(y)) dy}_{\text{Constant}} - \underbrace{\int p(y) \log(\hat{p}_j(y)) dy}_{\text{This is what we need to maximize}}\)
  • The Problem: We can’t calculate that second term because it requires knowing the true function \(p\).
  • Akaike’s Insight: Akaike proved that the log-likelihood we can calculate, \(\log \ell(\hat{\theta})\), is a biased estimator of that target. He also proved that the bias is approximately \(-d\).
  • The Solution: An unbiased estimate of the target is \(\log \ell(\hat{\theta}) - d\).
  • Final Step: For historical and statistical reasons, he multiplied this by \(-2\) to create the final AIC formula.
  • Conclusion: AIC is not just a random formula. It is a carefully derived estimate of how much information your model loses compared to the “truth” (i.e., its expected performance on new data).

AIC/BIC for Linear Regression

Slide 26 shows how these general formulas simplify for linear regression (assuming normal, Gaussian errors).

  • General Formula: \(AIC = -2 \log \ell(\hat{\theta}) + 2d\)
  • Linear Regression Formula: \(AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2d\hat{\sigma}^2)\)

Key Insight: For linear regression, the “poorness-of-fit” term (\(-2 \log \ell(\hat{\theta})\)) is directly proportional to the \(RSS\).

This makes it much easier to understand. You can just think of the formulas as: * AIC \(\approx\) \(RSS + 2d\hat{\sigma}^2\) * BIC \(\approx\) \(RSS + \log(n)d\hat{\sigma}^2\)

(Here \(\hat{\sigma}^2\) is an estimate of the error variance, which can often be treated as a constant).

This clearly shows the trade-off: We want a model with a low \(RSS\) (good fit) and a low \(d\) (low complexity). These two goals are in direct competition.

Mallow’s \(C_p\): The slide notes that \(C_p\) is equivalent to AIC for linear regression. The \(C_p\) formula is \(C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2_{full})\), where \(\hat{\sigma}^2_{full}\) is the error variance estimated from the full model. Since \(n\) and \(\hat{\sigma}^2_{full}\) are constants, minimizing \(C_p\) is mathematically identical to minimizing \(RSS + 2d\hat{\sigma}^2_{full}\), which is the same logic as AIC.

3. Variable Selection

Core Concept: The Problem of Variable Selection

In regression, we want to model a response variable \(Y\) using a set of \(p\) predictor variables \(X_1, X_2, ..., X_p\).

  • The “Kitchen Sink” Problem: A common temptation is to include all available predictors in the model: \[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon\] This often leads to overfitting. The model may fit the training data well but will perform poorly on new, unseen data. It’s also hard to interpret a model with dozens of predictors.

  • The Solution: Subset Selection. The goal is to find a smaller subset of the predictors that builds a model that is:

    1. Accurate: Has low prediction error.
    2. Parsimonious: Uses the fewest predictors necessary.
    3. Interpretable: Is simple enough for a human to understand.

Your slides present two main methods to achieve this: Best Subset Selection and Forward Stepwise Selection.

Method 1: Best Subset Selection (BSS)

This is the “brute force” approach. It considers every single possible model.

Conceptual Algorithm

  1. Fit all models with \(k=1\) predictor (there are \(p\) of these). Find the best one (lowest RSS) and call it \(M_1\).
  2. Fit all models with \(k=2\) predictors (there are \(\binom{p}{2}\) of these). Find the best one and call it \(M_2\).
  3. Fit the one model with \(k=p\) predictors (the full model), \(M_p\).
  4. You now have a list of \(p\) “best” models: \(M_1, M_2, ..., M_p\).
  5. Use a selection criterion (like Adjusted \(R^2\), BIC, AIC, or \(C_p\)) to choose the single best model from this list.

Mathematical & Computational Cost (from slide 225641.png)

  • For each predictor, there are two possibilities: it’s either IN the model or OUT.
  • With \(p\) predictors, the total number of models to test is \(2 \times 2 \times ... \times 2\) (\(p\) times).
  • Total Models = \(2^p\)
  • This is a “combinatorial explosion.” As the slide notes, if \(p=20\), \(2^{20} = 1,048,576\) models. This is computationally infeasible for large \(p\).

Method 2: Forward Stepwise Selection (FSS)

This is a “greedy” algorithm. It’s an efficient alternative to BSS that does not test every model.

Conceptual Algorithm (from slides 225645.png & 225648.png)

  • Step 1: Start with the null model, \(M_0\), which has no predictors. \[M_0: Y = \beta_0 + \epsilon\] The prediction is just the sample mean of \(Y\).

  • Step 2 (Iterative):

    • For \(k=0\) (to get \(M_1\)): Fit all \(p\) models that add one predictor to \(M_0\). Choose the best one (lowest RSS or highest \(R^2\)). This is \(M_1\). Let’s say it contains \(X_1\).
    • For \(k=1\) (to get \(M_2\)): Keep \(X_1\) in the model. Fit all \(p-1\) models that add one more predictor to \(M_1\) (e.g., \(M_1+X_2\), \(M_1+X_3\), …). Choose the best of these. This is \(M_2\).
    • Repeat: Continue this process, adding one variable at a time, until all \(p\) predictors are in the model \(M_p\).
  • Step 3: You now have a sequence of \(p+1\) models: \(M_0, M_1, ..., M_p\). Choose the single best model from this sequence using Adjusted \(R^2\), AIC, BIC, or \(C_p\).

Mathematical & Computational Cost (from slide 225651.png)

  • To find \(M_1\), you fit \(p\) models.
  • To find \(M_2\), you fit \(p-1\) models.
  • To find \(M_p\), you fit \(1\) model.
  • The null model \(M_0\) is 1 model.
  • Total Models = \(1 + \sum_{k=0}^{p-1} (p-k) = 1 + p + (p-1) + ... + 1 = 1 + \frac{p(p+1)}{2}\)
  • As the slide notes, if \(p=20\), this is only \(1 + 20(21)/2 = 211\) models. This is vastly more efficient than BSS.
  • Key weakness: The method is “greedy.” If it adds \(X_1\) in Step 1, it can never be removed. It’s possible the true best 2-variable model is \((X_2, X_3)\), but if FSS chose \(X_1\) as the best 1-variable model, it will never find \((X_2, X_3)\).

4. How to Choose the “Best” Model: The Criteria

You can’t use RSS or \(R^2\) to compare models with different numbers of predictors (\(k\)). This is because RSS always decreases (and \(R^2\) always increases) as you add more variables. You must use a criterion that penalizes complexity.

  • RSS (Residual Sum of Squares): Goal is to minimize. \[RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] Good for comparing models of the same size \(k\).

  • Adjusted R-squared (\(Adj. R^2\)): Goal is to maximize. \[Adj. R^2 = 1 - \frac{(1-R^2)(n-1)}{n-p-1}\] This “adjusts” \(R^2\) by adding a penalty for having more predictors (\(p\)). Adding a useless predictor will make \(Adj. R^2\) go down.

  • Mallow’s \(C_p\): Goal is to minimize. \[C_p \approx \frac{1}{n}(RSS + 2p\hat{\sigma}^2)\] Here, \(\hat{\sigma}^2\) is an estimate of the error variance from the full model (with all \(p\) predictors). A good model will have \(C_p \approx p\).

  • AIC (Akaike Information Criterion) & BIC (Bayesian Information Criterion): Goal is to minimize. \[AIC = 2p - 2\ln(\hat{L})\] \[BIC = p\ln(n) - 2\ln(\hat{L})\] Here, \(\hat{L}\) is the maximized likelihood of the model. You don’t need to calculate this by hand; software provides it.

    • Key difference: BIC’s penalty for \(p\) is \(p\ln(n)\), while AIC’s is \(2p\). Since \(\ln(n)\) is almost always \(> 2\) (for \(n>7\)), BIC applies a much heavier penalty for complexity.
    • This means BIC tends to choose smaller, more parsimonious models than AIC or \(Adj. R^2\).

5. Python Code Analysis (Slide 225546.jpg)

This slide shows the Python code for Best Subset Selection (BSS).

1
2
3
4
5
6
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from itertools import combinations # <-- This is the BSS engine

Block 1: Load the Credit dataset

1
2
3
4
5
6
7
8
9
10
# 1. Load the Credit dataset
Credit = pd.read_csv('Credit.csv')
Credit['ID'] = Credit['ID'].astype(str)
(num_samples, num_predictors) = Credit.shape

# Convert categorical text data to numerical (dummy variables)
Credit['Gender'] = Credit['Gender'].map({'Male': 1, 'Female': 0})
Credit['Student'] = Credit['Student'].map({'Yes': 1, 'No': 0})
Credit['Married'] = Credit['Married'].map({'Yes': 1, 'No': 0})
Credit['Ethnicity'] = Credit['Ethnicity'].map({'Asian': 1, 'Caucasian': 1, 'African American': 0})
  • pd.read_csv: Reads the data into a pandas DataFrame.
  • .map(): This is a crucial preprocessing step. Regression models require numbers, not text like ‘Yes’ or ‘Male’. This line converts those strings into 1s and 0s.

Block 2: Plot scatterplot matrix

1
2
3
4
5
6
# 2. Plot scatterplot matrix
selected_columns = ['Balance', 'Education', 'Age', 'Cards', 'Rating', 'Limit', 'Income']
sns.set(style="ticks")
sns.pairplot(Credit[selected_columns], diag_kind='kde')
plt.suptitle('Scatterplot Matrix', y=1.02)
plt.show()
  • sns.pairplot: A powerful visualization from the seaborn library. The resulting plot (right side of the slide) is a grid.
    • Diagonal plots (kde): Show the distribution (Kernel Density Estimate) of a single variable (e.g., ‘Balance’ is skewed right).
    • Off-diagonal plots (scatter): Show the relationship between two variables (e.g., ‘Limit’ and ‘Rating’ are almost perfectly linear). This helps you visually spot potentially strong predictors.

Block 3: Best Subset Selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 3. Best Subset Selection
# (This code is incomplete on the slide, I'll fill in the logic)

# Define target and predictors
target = 'Balance'
predictors = [col for col in Credit.columns if col != target]
nvmax = 10 # Max number of predictors to test (up to 10)

# Initialize lists to store model statistics
model_stats = []

# Iterate over number of predictors from 1 to nvmax
for k in range(1, nvmax + 1):

# Generate all possible combinations of predictors of size k
# This is the core of BSS
for subset in list(combinations(predictors, k)):

# Get the design matrix (X)
X_subset = Credit[list(subset)]

# Add a constant (intercept) term to the model
# Y = B0 + B1*X1 -> statsmodels needs B0 to be added manually
X_subset_const = sm.add_constant(X_subset)

# Get the target variable (y)
y_target = Credit[target]

# Fit the Ordinary Least Squares (OLS) model
model = sm.OLS(y_target, X_subset_const).fit()

# Calculate RSS
RSS = ((model.resid) ** 2).sum()

# (The full code would also calculate R-squared, Adj. R-sq, BIC, etc. here)
# model_stats.append({'k': k, 'subset': subset, 'RSS': RSS, ...})
  • for k in range(1, nvmax + 1): This is the outer loop that iterates from \(k=1\) (1 predictor) to \(k=10\) (10 predictors).
  • list(combinations(predictors, k)): This is the inner loop and the most important line. The itertools.combinations function is a highly efficient way to generate all unique subsets.
    • When \(k=1\), it returns [('Income',), ('Limit',), ('Rating',), ...].
    • When \(k=2\), it returns [('Income', 'Limit'), ('Income', 'Rating'), ('Limit', 'Rating'), ...].
    • This is what generates the \(2^p\) (or in this case, \(\sum_{k=1}^{10} \binom{p}{k}\)) models to test.
  • sm.add_constant(X_subset): Your regression equation is \(Y = \beta_0 + \beta_1X_1\). The \(X_1\) is your X_subset. The sm.add_constant function adds a column of 1s to your data, which allows the statsmodels library to estimate the \(\beta_0\) (intercept) term.
  • sm.OLS(y_target, X_subset_const).fit(): This fits the Ordinary Least Squares (OLS) model, which finds the \(\beta\) coefficients that minimize the RSS.
  • model.resid: This attribute of the fitted model contains the residuals (\(e_i = y_i - \hat{y}_i\)) for each data point.
  • ((model.resid) ** 2).sum(): This line is the direct code implementation of the formula \(RSS = \sum e_i^2\).

Synthesizing the Results (The Plots)

After running the BSS code, you get the data used in the plots and the table.

  • Image 225550.png (Adjusted R-squared)

    • Goal: Maximize.
    • What it shows: The gray dots are all the models tested for each \(k\). The red line connects the single best model for each \(k\).
    • Conclusion: The plot shows a sharp “elbow.” The \(Adj. R^2\) increases dramatically up to \(k=4\), then increases very slowly. The maximum is around \(k=6\) or \(k=7\), but the gain after \(k=4\) is minimal.
  • Image 225554.png (BIC)

    • Goal: Minimize.
    • What it shows: BIC heavily penalizes complexity.
    • Conclusion: The plot shows a very clear minimum. The BIC value plummets from \(k=2\) to \(k=3\) and hits its lowest point at \(k=4\). After \(k=4\), the penalty for adding more variables is larger than the benefit in model fit, so the BIC score starts to rise. This is a very strong vote for the 4-predictor model.
  • Image 225635.png (Mallow’s \(C_p\))

    • Goal: Minimize.
    • What it shows: A very similar story to BIC.
    • Conclusion: The \(C_p\) value drops significantly and hits its minimum at \(k=4\).
  • Image 225638.png (Summary Table)

    • This is the most important image for the final conclusion. It summarizes the red line from all the plots.
    • Look at the row for Num_Predictors = 4. The predictors are (Income, Limit, Cards, Student).
    • Now look at the columns for BIC and Cp.
      • BIC: 4841.615607. This is the lowest value in the entire BIC column (the value at \(k=3\) is 4865.352851).
      • Cp: 7.122228. This is also the lowest value in the Cp column.
    • The Adj_R_squared at \(k=4\) is 0.953580, which is very close to its maximum of ~0.954 at \(k=7-10\).

Final Conclusion: All three “penalized” criteria (Adjusted \(R^2\), BIC, and \(C_p\)) point to the same conclusion. While \(Adj. R^2\) is a bit ambiguous, BIC and \(C_p\) provide a clear signal that the best, most parsimonious model is the 4-predictor model using Income, Limit, Cards, and Student.

4. Subset Selection

Summary of Subset Selection

These slides introduce subset selection, a process in statistical learning used to identify the best subset of predictors (variables) for a regression model. The goal is to find a model that has low prediction error and avoids overfitting by excluding irrelevant variables.

The slides cover two main “greedy” (stepwise) algorithms and the criteria used to select the final best model.

Stepwise Selection Algorithms

Instead of testing all \(2^p\) possible models (which is “best subset selection” and computationally unfeasible), stepwise methods build a single path of models.

Forward Stepwise Selection

This is an additive (bottom-up) approach:

  1. Start with the null model (no predictors).
  2. Find the best 1-variable model (the one that gives the lowest Residual Sum of Squares, or RSS).
  3. Add the single variable that, when added to the current model, results in the new best model (lowest RSS).
  4. Repeat this process until all \(p\) predictors are in the model.
  5. This generates a sequence of \(p+1\) models, from \(\mathcal{M}_0\) to \(\mathcal{M}_p\).

Backward Stepwise Selection

This is a subtractive (top-down) approach:

  1. Start with the full model containing all \(p\) predictors.
  2. Find the best \((p-1)\)-variable model by removing the single variable that results in the lowest RSS (or highest \(R^2\)). This variable is considered the least significant.
  3. Remove the next variable that, when removed from the current best model, gives the new best model.
  4. Repeat until only the null model remains.
  5. This also generates a sequence of \(p+1\) models.

Pros and Cons (Backward Selection)

  • Pro: Computationally efficient compared to best subset. It fits \(1 + \sum_{k=0}^{p-1}(p-k) = \mathbf{1 + p(p+1)/2}\) models, which is much less than \(2^p\). (e.g., for \(p=20\), it’s 211 models vs. >1 million).
  • Con: Cannot be used if \(p > n\) (more predictors than observations), because the initial full model cannot be fit.
  • Con (for both): These methods are greedy. A variable added in forward selection is never removed, and a variable removed in backward selection is never added back. This means they are not guaranteed to find the true best model.

Choosing the Final Best Model

Both forward and backward selection give you a set of candidate models (e.g., the best 1-variable model, best 2-variable model, etc.). You must then choose the single best one. The slides show two main approaches:

A. Direct Error Estimation

Use a validation set or cross-validation (CV) to estimate the test error for each model (e.g., the 1-variable, 2-variable… models). Choose the model with the lowest estimated test error.

B. Adjusted Metrics (Penalizing for Complexity)

Standard RSS and \(R^2\) will always improve as you add variables, leading to overfitting. Instead, use metrics that penalize the model for having too many predictors.

  • Mallows’ \(C_p\): An estimate of test Mean Squared Error (MSE). \[C_p = \frac{1}{n} (RSS + 2d\hat{\sigma}^2)\] (where \(d\) is the number of predictors, and \(\hat{\sigma}^2\) is an estimate of the error variance). You want to find the model with the minimum \(C_p\).

  • BIC (Bayesian Information Criterion): \[BIC = \frac{1}{n} (RSS + \log(n)d\hat{\sigma}^2)\] BIC’s penalty \(\log(n)\) is stronger than \(C_p\)’s (or AIC’s) penalty of \(2\), so it tends to select smaller (more parsimonious) models. You want to find the model with the minimum BIC.

  • Adjusted \(R^2\): \[R^2_{adj} = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}\] (where \(TSS\) is the Total Sum of Squares). Unlike \(R^2\), this metric can decrease if adding a variable doesn’t help enough. You want to find the model with the maximum Adjusted \(R^2\).

Python Code Understanding

The slides use the regsubsets() function from the leaps package in R.

1
2
3
4
5
6
# R Code from slides
library(leaps)
# Forward Selection
regfit.fwd <- regsubsets(Balance~., data=Credit, method="forward", nvmax=11)
# Backward Selection
regfit.bwd <- regsubsets(Balance~., data=Credit, method="backward", nvmax=11)

In Python, the standard tool for this is SequentialFeatureSelector from scikit-learn.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split

# Assume 'Credit' is a pandas DataFrame with 'Balance' as the target
X = Credit.drop('Balance', axis=1)
y = Credit['Balance']

# Initialize the linear regression estimator
model = LinearRegression()

# --- Forward Selection ---
# direction='forward' starts with 0 features and adds them
# To get the best 4-variable model, for example:
sfs_forward = SequentialFeatureSelector(
model,
n_features_to_select=4,
direction='forward',
cv=None # Or use cross-validation, e.g., cv=10
)
sfs_forward.fit(X, y)
print("Forward selection best 4 features:")
print(sfs_forward.get_feature_names_out())


# --- Backward Selection ---
# direction='backward' starts with all features and removes them
sfs_backward = SequentialFeatureSelector(
model,
n_features_to_select=4,
direction='backward',
cv=None
)
sfs_backward.fit(X, y)
print("\nBackward selection best 4 features:")
print(sfs_backward.get_feature_names_out())

# Note: To replicate the plots, you would loop this process,
# changing 'n_features_to_select' from 1 to p,
# record the model scores (e.g., RSS, AIC, BIC) at each step,
# and then plot the results.

Important Images

  1. Slide ...230014.png (Forward Selection Plots) & ...230036.png (Backward Selection Plots):

    • What they are: These \(2 \times 2\) plot grids are the most important visuals. They show Residual Sum of Squares (RSS), Adjusted \(R^2\), BIC, and Mallows’ \(C_p\) plotted against the Number of Variables.
    • Why they’re important: They are the decision-making tool. You use these plots to choose the best model.
      • You look for the “elbow” or minimum value for BIC and \(C_p\).
      • You look for the “peak” or maximum value for Adjusted \(R^2\).
      • (RSS is not used for selection as it always decreases).
  2. Slide ...230040.png (Find the best model):

    • What it is: This slide shows a close-up of the \(C_p\), BIC, and Adjusted \(R^2\) plots, with the “best” model (the min/max) marked with a blue ‘x’.
    • Why it’s important: It explicitly states the selection criteria. The text highlights that BIC suggests a 4-variable model, while the other two are “rather flat” after 4, making the choice less obvious but pointing to a simple model.
  3. Slide ...230045.png (BIC vs. Validation vs. CV):

    • What it is: This shows three plots for selecting the best model using different criteria: BIC, Validation Set Error, and Cross-Validation Error.
    • Why it’s important: It shows that different selection criteria can lead to different “best” models. Here, BIC (a mathematical adjustment) picks a 4-variable model, while validation and CV (direct error estimation) both pick a 6-variable model.

The slides use the Credit dataset to demonstrate two key tasks: 1. Running different subset selection algorithms (forward, backward, best). 2. Using various statistical metrics (BIC, \(C_p\), CV error) to choose the single best model.

Comparing Selection Algorithms (The Path)

This part of the example compares the sequence of models selected by “Forward Stepwise” selection versus “Best Subset” selection.

Key Result (from Table 6.1):

This table is the most important result for comparing the algorithms.

Variables Best Subset Forward Stepwise
one rating rating
two rating, income rating, income
three rating, income, student rating, income, student
four cards, income, student, limit rating, income, student, limit

Summary of this result:

  • Identical for 1, 2, and 3 variables: Both methods agree on the best one-variable model (rating), the best two-variable model (rating, income), and the best three-variable model (rating, income, student).
  • They Diverge at 4 variables:
    • Forward selection is greedy. It started with rating, income, student and was “stuck” with them. It then added limit, as that was the best variable to add to its existing 3-variable model.
    • Best subset selection is not greedy. It tests all possible 4-variable combinations. It discovered that the model cards, income, student, limit has a slightly lower RSS than the model forward selection found.
  • Main Takeaway: This demonstrates the limitation of a greedy algorithm. Forward selection missed the “true” best 4-variable model because it was locked into its previous choices and couldn’t “swap out” rating for cards.

Choosing the Single Best Model (The Destination)

This is the most critical part of the analysis. After running a selection algorithm (like forward, backward, or best subset), you get a list of the “best” models for each size (best 1-variable, best 2-variable, etc.). Now you must decide: is the best model the 4-variable one, the 6-variable one, or another?

The slides show several plots to help make this decision, all plotted against the “Number of Predictors.”

Summary of Plot Results:

Here’s what each plot tells you:

  • Residual Sum of Squares (RSS) (e.g., in slide ...230014.png, top-left)
    • What it shows: RSS always decreases as you add more variables. It drops sharply until 4 variables, then flattens out.
    • Conclusion: This plot is not useful for picking the best model because it will always pick the full model, which is overfit. It’s only used to see the diminishing returns of adding new variables.
  • Adjusted \(R^2\) (e.g., in slide ...230040.png, right)
    • What it shows: This metric penalizes adding useless variables. The plot rises quickly, then flattens, peaking at its maximum value around 6 or 7 variables.
    • Conclusion: This metric suggests a 6 or 7-variable model.
  • Mallows’ \(C_p\) (e.g., in slide ...230040.png, left)
    • What it shows: This is an estimate of test error. We want the model with the minimum \(C_p\). The plot drops to a low value at 4 variables and stays low, with its absolute minimum around 6 or 7 variables.
    • Conclusion: This metric also suggests a 6 or 7-variable model.
  • BIC (Bayesian Information Criterion) (e.g., in slide ...230040.png, center)
    • What it shows: This is another estimate of test error, but it has a stronger penalty for model complexity. The plot shows a clear “U” shape, reaching its minimum value at 4 variables and then increasing afterward.
    • Conclusion: This metric strongly suggests a 4-variable model.
  • Validation Set & Cross-Validation (CV) Error (Slide ...230045.png)
    • What it shows: These plots show the direct estimate of test error (not a mathematical adjustment like BIC or \(C_p\)). Both the validation set error and the 10-fold CV error show a “U” shape.
    • Conclusion: Both methods reach their minimum error at 6 variables. This is considered a very reliable result.

Final Summary of Results

The analysis of the Credit dataset reveals two strong candidates for the “best” model, depending on your goal:

  1. The 6-Variable Model: This model is supported by the Adjusted \(R^2\), Mallows’ \(C_p\), and (most importantly) the Validation Set and 10-fold Cross-Validation results. These metrics all indicate that the 6-variable model has the lowest prediction error on new data.

  2. The 4-Variable Model: This model is supported by BIC. Because BIC penalizes complexity more heavily, it selects a simpler (more parsimonious) model.

Overall Conclusion: If your primary goal is maximum predictive accuracy, you should choose the 6-variable model. If your goal is a simpler, more interpretable model that is still very good (and avoids any risk of overfitting), the 4-variable model is an excellent choice.

5. Two main strategies for controlling model complexity in linear regression

This presentation covers two main strategies for controlling model complexity in linear regression: Subset Selection (choosing which variables to include) and Shrinkage Methods (keeping all variables but reducing the impact of their coefficients).

Subset Selection

This method involves selecting a subset of the \(p\) total predictors to use in the model.

Key Concepts & Formulas

  • The Model: The standard linear regression model is represented in matrix form: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\] The goal of subset selection is to find a coefficient vector \(\boldsymbol{\beta}\) that is sparse, meaning it has many zero entries.

  • Forward Selection: This is a greedy algorithm that starts with an empty model and iteratively adds the single predictor that most improves the fit.

  • Theoretical Guarantee: Can forward selection find the true sparse set of variables?

    • Yes, if the predictors are not strongly correlated.
    • This is quantified by the Mutual Coherence Condition. Assuming the predictors \(\mathbf{x}_i\) are normalized, the method is guaranteed to work if: \[\mu = \max_{i \neq j} |\langle \mathbf{x}_i, \mathbf{x}_j \rangle| < \frac{1}{2s - 1}\] where \(s\) is the number of true non-zero coefficients and \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\) represents the correlation between predictors.

Practical Application: Finding the Best Model Size

How do you know whether to choose a model with 3, 4, or 5 variables? You use Cross-Validation (CV).

  • Important Image: The plot titled “10-fold CV” (from the first slide) is the most important visual. It plots the estimated test error (CV Error) on the y-axis against the number of variables in the model on the x-axis.

  • The “One Standard Deviation Rule”: Looking at the plot, the error drops sharply and then flattens. The absolute minimum error might be at 6 variables, but it’s only slightly better than the 3-variable model.

    1. Find the model with the lowest CV error.
    2. Calculate the standard error for that error estimate.
    3. Select the simplest model (fewest variables) whose error is within one standard deviation of the minimum.
    4. This follows Occam’s razor: choose the simplest explanation (model) that fits the data well enough. In the example given, this rule selects the 3-variable model.

Code Interpretation (R vs. Python)

The R code in the first slide performs this 10-fold CV manually for forward selection:

  1. It loops from p = 1 to 10 (model sizes).
  2. Inside the loop, it identifies the p variables chosen by a pre-computed forward selection model (regfit.fwd).
  3. It fits a new model (glm.fit) using only those p variables.
  4. It runs 10-fold CV (cv.glm) on that specific model to get its test error.
  5. It stores the error in CV10.err[p].
  6. Finally, it plots the results.

In Python (with scikit-learn): This entire process is often automated.

  • You would use sklearn.feature_selection.RFECV (Recursive Feature Elimination with Cross-Validation).
  • RFECV automatically performs cross-validation to find the optimal number of features, effectively producing the same plot and result as the R code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Conceptual Python equivalent for finding the best model size
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_regression

# X, y = load_your_data()
X, y = make_regression(n_samples=100, n_features=10, n_informative=3, noise=10, random_state=42)

estimator = LinearRegression()
# RFECV will test models with 1 feature, 2 features, etc.,
# and use cross-validation (cv=10) to find the best number.
selector = RFECV(estimator, step=1, cv=10, scoring='neg_mean_squared_error')
selector = selector.fit(X, y)

print(f"Optimal number of features: {selector.n_features_}")
# You can plot selector.cv_results_['mean_test_score'] to get the CV curve

Shrinkage Methods (Regularization)

Instead of explicitly removing variables, shrinkage methods keep all \(p\) variables but shrink their coefficients \(\beta_j\) towards zero.

Ridge Regression

Ridge regression is a prime example of a shrinkage method.

  • Objective Function: It finds the coefficients \(\boldsymbol{\beta}\) that minimize a new quantity: \[\underbrace{\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2}_{\text{RSS (Goodness of Fit)}} + \underbrace{\lambda \sum_{j=1}^{p} \beta_j^2}_{\text{$\ell_2$ Penalty (Shrinkage)}}\]

  • The \(\lambda\) Tuning Parameter: This parameter controls the strength of the penalty:

    • If \(\lambda = 0\): The penalty term disappears. Ridge regression is identical to standard Ordinary Least Squares (OLS).
    • If \(\lambda \to \infty\): The penalty is “infinitely” strong. To minimize the function, all coefficients \(\beta_j\) (for \(j=1...p\)) are forced to be zero. The model becomes an intercept-only model.
    • Note: The intercept \(\beta_0\) is not penalized.
  • The Bias-Variance Trade-off: This is the core concept of regularization.

    • Standard OLS has low bias but can have high variance (it overfits).
    • Ridge regression adds a small amount of bias (the coefficients are “wrong” on purpose) to significantly reduce the model’s variance.
    • This trade-off often leads to a model with a lower overall test error.
  • Matrix Solution: The discussion slide asks “What is the solution?”. While OLS has the solution \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\), the Ridge solution is: \[\hat{\boldsymbol{\beta}}^R = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\] where \(\mathbf{I}\) is the identity matrix. The \(\lambda \mathbf{I}\) term adds a “ridge” to the diagonal, making the matrix invertible even if \(\mathbf{X}^T\mathbf{X}\) is singular (which happens if \(p > n\) or predictors are collinear).

An Essential Step: Standardization

  • Problem: The \(\ell_2\) penalty \(\lambda \sum \beta_j^2\) is applied equally to all coefficients. If predictor \(x_1\) (e.g., house size in sq-ft) is on a much larger scale than \(x_2\) (e.g., number of rooms), its coefficient \(\beta_1\) will naturally be much smaller than \(\beta_2\). The penalty will unfairly punish \(\beta_2\) more.
  • Solution: You must standardize your inputs before fitting a Ridge model.
  • Formula: For each predictor \(X_j\), all its observations \(x_{ij}\) are rescaled: \[\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j}\] (where \(\bar{x}_j\) is the mean of the predictor and \(\sigma_j\) is its standard deviation). This puts all predictors on a common scale (mean=0, std=1).

In Python (with scikit-learn):

  • You use sklearn.preprocessing.StandardScaler to standardize your data.
  • You use sklearn.linear_model.Ridge to fit the model.
  • You use sklearn.linear_model.RidgeCV to automatically find the best value for \(\lambda\) (called alpha in scikit-learn) using cross-validation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Conceptual Python code for Ridge Regression
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# X, y = load_your_data()

# Create a pipeline that first standardizes the data,
# then fits a Ridge model.
# RidgeCV tests a range of alphas (lambdas) automatically.
model = make_pipeline(
StandardScaler(),
RidgeCV(alphas=[0.1, 1.0, 10.0, 100.0], scoring='neg_mean_squared_error')
)

model.fit(X, y)

print(f"Best alpha (lambda): {model.named_steps['ridgecv'].alpha_}")
print(f"Model coefficients: {model.named_steps['ridgecv'].coef_}")

Subset Selection

This section is about choosing which predictors (variables) to include in your linear model. The main idea is to find a “sparse” model (one with few variables) that performs well.

The Model and The Goal

  • Slide: “Forward selection in Linear Regression”
  • Formula: The standard linear regression model is \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\)
    • \(\mathbf{y}\) is the \(n \times 1\) vector of outcomes.
    • \(\mathbf{X}\) is the \(n \times (p+1)\) matrix of predictors (with a leading column of 1s for the intercept).
    • \(\boldsymbol{\beta}\) is the \((p+1) \times 1\) vector of coefficients (\(\beta_0, \beta_1, ..., \beta_p\)).
    • \(\boldsymbol{\epsilon}\) is the \(n \times 1\) vector of irreducible error.
  • Key Question: “If \(\boldsymbol{\beta}\) is sparse with at most \(s\) non-zero entries, can forward selection find those variables?”
    • Sparse means most coefficients are zero.
    • Forward Selection is a greedy algorithm:
      1. Start with no variables.
      2. Add the one variable that gives the best fit.
      3. Add the next best variable to the existing model.
      4. Repeat until you have a model with \(s\) variables.
    • The slide suggests the answer is yes, but only under certain conditions.

The Condition for Success

  • Slide: “Orthogonal Matching Pursuit”
  • Key Concept: Forward selection can provably find the correct variables if those variables are not strongly correlated.
  • Formula: This is formalized by the Mutual Coherence Condition: \[\mu = \max_{i \neq j} |\langle \mathbf{x}_i, \mathbf{x}_j \rangle| < \frac{1}{2s - 1}\]
    • What it means:
      • assuming $\mathbf{x}_i$'s are normalized means we’ve scaled them to have a length of 1.
      • \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\) is the dot product, which is just their correlation since they are normalized.
      • \(\mu\) (mu) is the largest absolute correlation you can find between any two different predictors.
      • \(s\) is the true number of important variables.
    • In English: If the maximum correlation between any of your predictors is less than this threshold, the greedy forward selection algorithm is guaranteed to find the true, sparse set of variables.

How to Choose the Model Size (Practice)

The theory is nice, but in practice, you don’t know \(s\). How many variables should you pick?

  • Slide: “10-fold CV Errors”

  • This is the most important practical slide for this section.

  • What the plot shows:

    • X-axis: “Number of Variables” (from 1 to 10).
    • Y-axis: “CV Error” (the 10-fold cross-validated Mean Squared Error).
    • The Curve: The error drops very fast as we add the first 2-3 variables. Then, it flattens out. Adding more than 3 variables doesn’t really help much.
  • Slide: “The one standard deviation rule”

  • This rule helps you pick the “best” model from the CV plot.

    1. Find the model with the absolute minimum CV error (in the plot, this looks to be around 6 or 7 variables).
    2. Calculate the standard error of that minimum CV error.
    3. Draw a “tolerance” line at (minimum error) + (one standard error).
    4. Choose the simplest model (fewest variables) whose CV error is below this tolerance line.
    • The slide states this rule “gives the model with 3 variable” for this example. This is because the 3-variable model is much simpler than the 6-variable one, and its error is “good enough” (within one standard deviation of the minimum). This is an application of Occam’s razor.

Code: R vs. Python

The R code on the “10-fold CV Errors” slide generates that exact plot.

  • R Code Explained:

    • library(boot): Loads the cross-validation library.
    • CV10.err=rep(0,10): Creates an empty vector to store the 10 error scores.
    • for(p in 1:10): A loop that will test model sizes from 1 to 10.
    • x<-which(summary(regfit.fwd)$which[p,]): Gets the names of the \(p\) variables chosen by a pre-run forward selection (regfit.fwd).
    • glm.fit=glm(Balance~.,data=newCred): Fits a model using only those \(p\) variables.
    • cv.err=cv.glm(newCred,glm.fit,K=10): Performs 10-fold CV on that specific \(p\)-variable model.
    • CV10.err[p]<-cv.err$delta[1]: Stores the CV error.
    • plot(...): Plots the 10 errors against the 10 model sizes.
  • Python Equivalent (Conceptual):

    • In scikit-learn, this process is often automated. You wouldn’t write the CV loop yourself.
    • You would use sklearn.feature_selection.RFECV (Recursive Feature Elimination with Cross-Validation). This tool automatically wraps a model (like LinearRegression), performs cross-validation, and finds the optimal number of features, effectively producing the same plot and result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# --- Python equivalent for 6.1 ---
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Assume X and y are your data

# 1. Create a pipeline
# (Note: It's good practice to scale, even for OLS, if you're comparing)
pipeline = make_pipeline(
StandardScaler(),
LinearRegression()
)

# 2. Create the RFECV (Recursive Feature Elimination w/ CV) object
# This is an *alternative* to forward selection, but serves the same purpose
# It will test models with 1, 2, 3... features using 10-fold CV
feature_selector = RFECV(
estimator=pipeline,
min_features_to_select=1,
step=1,
cv=10,
scoring='neg_mean_squared_error' # We want to minimize error
)

# 3. Fit it
feature_selector.fit(X, y)

print(f"Optimal number of features found: {feature_selector.n_features_}")

# You could then plot feature_selector.cv_results_['mean_test_score']
# to replicate the R plot.

Shrinkage Methods by Regularization

This is a different approach. Instead of removing variables, we keep all \(p\) variables but shrink their coefficients \(\beta_j\) towards 0.

Ridge Regression: The Core Idea

  • Slide: “Ridge regression”
  • Formula: Ridge regression minimizes a new objective function: \[\min_{\boldsymbol{\beta}} \left( \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right)\]
    • Term 1: \(\text{RSS}\) (Residual Sum of Squares). This is the original OLS “goodness of fit” term. We want this to be small.
    • Term 2: \(\lambda \sum \beta_j^2\). This is the \(\ell_2\) penalty or “shrinkage penalty”. It adds a “cost” for having large coefficients.
  • The \(\lambda\) (lambda) Parameter:
    • This is the tuning parameter that controls the trade-off between fit and simplicity.
    • $\lambda = 0$: No penalty. The objective is just to minimize RSS. The solution \(\hat{\boldsymbol{\beta}}^R\) is identical to the OLS solution \(\hat{\boldsymbol{\beta}}\).
    • $\lambda = \infty$: Infinite penalty. The only way to minimize the cost is to make all \(\beta_j = 0\) (for \(j \ge 1\)). The model becomes an intercept-only model.
    • Large $\lambda$: Heavy penalty, more shrinkage.
    • Crucial Note: The intercept \(\beta_0\) is not penalized. This is because \(\beta_0\) just represents the mean of \(y\) when all \(x\)’s are 0; shrinking it makes no sense.

The Need for Standardization

  • Slide: “Standardize the inputs”
  • Problem: The penalty \(\lambda \sum \beta_j^2\) is applied to all coefficients. But what if \(x_1\) is “house size in sq-ft” (values 1000-5000) and \(x_2\) is “number of bedrooms” (values 1-5)?
    • The coefficient \(\beta_1\) for house size will naturally be tiny, while the coefficient \(\beta_2\) for bedrooms will be large, even if they are equally important.
    • Ridge regression would unfairly and heavily penalize \(\beta_2\) while barely touching \(\beta_1\).
  • Solution: You must standardize all predictors before fitting a Ridge model.
  • Formula: For each observation \(i\) of each predictor \(j\): \[\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sqrt{(1/n) \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)^2}}\]
    • This formula rescales every predictor to have a mean of 0 and a standard deviation of 1.
    • Now, all coefficients \(\beta_j\) are on a “level playing field” and can be penalized fairly.

Answering the Discussion Questions

  • Slide: “DISCUSSION”
    • What is the solution of Ridge regression?
    • What is the bias and the variance?

1. What is the solution of Ridge regression?

The solution can be written in matrix form, which is very elegant.

  • Standard OLS Solution: The coefficients \(\hat{\boldsymbol{\beta}}^{\text{OLS}}\) that minimize RSS are found by: \[\hat{\boldsymbol{\beta}}^{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

  • Ridge Regression Solution: The coefficients \(\hat{\boldsymbol{\beta}}^{R}\) that minimize the Ridge objective are: \[\hat{\boldsymbol{\beta}}^{R} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

    • Explanation:
      • \(\mathbf{I}\) is the identity matrix (a matrix of 1s on the diagonal, 0s everywhere else).
      • By adding \(\lambda\mathbf{I}\), we are adding a positive value \(\lambda\) to the diagonal of the \(\mathbf{X}^T\mathbf{X}\) matrix.
      • This addition stabilizes the matrix. \(\mathbf{X}^T\mathbf{X}\) might not be invertible (if \(p > n\) or if predictors are perfectly collinear), but \((\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})\) is always invertible for \(\lambda > 0\).
      • This addition is what mathematically “shrinks” the coefficients toward zero.

2. What is the bias and the variance?

This is the most important concept in regularization. It’s the bias-variance trade-off.

  • Standard OLS (where \(\lambda=0\)):

    • Bias: Low. The OLS estimator is unbiased, meaning that if you took many samples and fit many OLS models, their average \(\hat{\boldsymbol{\beta}}\) would be the true \(\boldsymbol{\beta}\).
    • Variance: High. The OLS solution can be highly sensitive to the training data. If you change a few data points, the coefficients can swing wildly. This is especially true if \(p\) is large or predictors are correlated. This “sensitivity” is high variance, which leads to overfitting.
  • Ridge Regression (where \(\lambda > 0\)):

    • Bias: High(er). Ridge regression is a biased estimator. By adding the penalty, we are purposefully pulling the coefficients away from the OLS solution and towards zero. The average \(\hat{\boldsymbol{\beta}}^R\) from many samples will not equal the true \(\boldsymbol{\beta}\). We have introduced bias into our model.
    • Variance: Low(er). In exchange for this bias, we get a massive reduction in variance. The \(\lambda\mathbf{I}\) term stabilizes the solution. The coefficients won’t change wildly even if the training data changes. The model is more robust and less sensitive.

The Trade-off: The total expected test error of a model is: \(\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\)

By using Ridge regression, we increase the \(\text{Bias}^2\) term a little, but we decrease the \(\text{Variance}\) term a lot. The goal is to find a \(\lambda\) where the total error is minimized. Ridge regression reduces variance at the cost of increased bias.

Python Equivalent for 6.2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# --- Python equivalent for 6.2 ---
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Assume X and y are your data

# 1. Create a pipeline that AUTOMATICALLY
# - Standardizes the data
# - Fits a Ridge Regression model
# - Uses Cross-Validation to find the BEST lambda (alpha in scikit-learn)
alphas_to_test = [0.01, 0.1, 1.0, 10.0, 100.0]

# RidgeCV handles everything for us
pipeline = make_pipeline(
StandardScaler(),
RidgeCV(alphas=alphas_to_test, scoring='neg_mean_squared_error', cv=10)
)

# 2. Fit the pipeline
pipeline.fit(X, y)

# 3. Get the results
best_lambda = pipeline.named_steps['ridgecv'].alpha_
ridge_coefficients = pipeline.named_steps['ridgecv'].coef_
intercept = pipeline.named_steps['ridgecv'].intercept_

print(f"Best lambda (alpha) found by CV: {best_lambda}")
print(f"Model intercept (beta_0): {intercept}")
print(f"Model coefficients (beta_j): {ridge_coefficients}")

6. The “Why” of Ridge Regression

Core Concepts: The “Why” of Ridge Regression

Your slides explain that ridge regression is a “shrinkage method” designed to solve a major problem with standard Ordinary Least Squares (OLS) regression: high variance.

The Bias-Variance Tradeoff (Slide 3)

This is the most important theoretical concept. In prediction, the total error (Mean Squared Error, or MSE) of a model is composed of three parts: \(\text{Error} = \text{Variance} + \text{Bias}^2 + \text{Irreducible Error}\)

  • Ordinary Least Squares (OLS): Aims to be unbiased (low bias). However, when you have many predictors (\(p\)), especially if they are correlated, or if \(p\) is large compared to the number of samples \(n\) (\(p \approx n\) or \(p > n\)), the OLS model becomes highly unstable. A small change in the training data can cause the coefficients to change wildly. This is high variance. (See Slide 6, “Remarks”).
  • Ridge Regression: By adding a penalty, ridge intentionally introduces a small amount of bias (it pulls coefficients away from their “true” OLS values). In return, it achieves a massive reduction in variance.

As Slide 3 shows:

  • The green line (Variance) starts very high for low \(\lambda\) (left side) and drops quickly.
  • The black line (Squared Bias) starts at zero (for OLS at \(\lambda=0\)) and slowly increases as \(\lambda\) grows.
  • The purple line (Test MSE) is the sum of the two. It’s U-shaped. The goal of ridge is to find the \(\lambda\) (marked by the ‘x’) at the bottom of this “U,” which gives the lowest possible total error.

Why Is It Called “Ridge”? The 3D Spatial Meaning (Slide 5)

This slide explains the problem of collinearity and the origin of the name.

  • Left Plot (Least Squares): Imagine a model with two correlated predictors, \(\beta_1\) and \(\beta_2\). The y-axis (SS1) is the error (RSS). Because the predictors are correlated, there isn’t one single “point” that is the minimum. Instead, there’s a long, flat valley or trough (marked “unstable”). Many different combinations of \(\beta_1\) and \(\beta_2\) along this valley give a similarly low error. The OLS solution is unstable because it can pick any point in this flat-bottomed valley.
  • Right Plot (Ridge): The ridge objective function adds a penalty term: \(\lambda(\beta_1^2 + \beta_2^2)\). This penalty term, by itself, is a perfect circular bowl centered at (0,0). When you add this “bowl” to the OLS “valley,” it stabilizes the function. It pulls the minimum towards (0,0) and creates a single, stable, well-defined minimum.
  • The “Ridge” Name: The penalty \(\lambda\mathbf{I}\) (from the matrix formula) adds a “ridge” of values to the diagonal of the \(\mathbf{X}^T\mathbf{X}\) matrix, which geometrically turns the unstable flat valley into a stable bowl.

Mathematical Formulas

The key difference between OLS and Ridge is the function they try to minimize.

  1. OLS Objective Function: Minimize the Residual Sum of Squares (RSS). \[\text{RSS} = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2\]

  2. Ridge Objective Function (Slide 6): Minimize the RSS plus an L2 penalty term. \[\text{Minimize: } \left[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 \right] + \lambda \sum_{j=1}^{p} \beta_j^2\]

    • \(\lambda\) is the tuning parameter controlling the penalty strength.
    • \(\sum_{j=1}^{p} \beta_j^2\) is the L2-norm (squared) of the coefficients. It penalizes large coefficients.
  3. L2 Norm (Slide 1): The L2 norm of a vector \(\mathbf{a}\) is its standard Euclidean length. The plot on Slide 1 uses this to show the total magnitude of the ridge coefficients. \[\|\mathbf{a}\|_2 = \sqrt{\sum_{j=1}^p a_j^2}\]

  4. Matrix Solution (Slide 6): This is the “closed-form” solution for the ridge coefficients \(\hat{\beta}^R\). \[\hat{\beta}^R = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

    • \(\mathbf{I}\) is the identity matrix.
    • The term \(\lambda\mathbf{I}\) is what stabilizes the \(\mathbf{X}^T\mathbf{X}\) matrix, making it invertible even if it’s singular (due to \(p > n\) or collinearity).

Walkthrough of the “Credit Data” Example (All Slides)

Here is the logical story of the R code, from start to finish.

Step 1: Data Preparation (Slide 8)

  • x=scale(model.matrix(Balance~., Credit)[,-1])
    • model.matrix(...) creates the predictor matrix x.
    • scale(...) is critically important. It standardizes all predictors to have a mean of 0 and a standard deviation of 1. This is necessary because the ridge penalty \(\lambda \sum \beta_j^2\) is unit-dependent. If Income (in 10,000s) and Cards (1-10) were unscaled, the penalty would unfairly crush the Income coefficient. Scaling puts all predictors on a level playing field.
  • y=Credit$Balance
    • This sets the y (target) variable.

Step 2: Fit the Ridge Model (Slide 8)

  • grid=10^seq(4,-2,length=100)
    • This creates a grid of 100 \(\lambda\) values to test, ranging from \(10^4\) (a huge penalty) down to \(10^{-2}\) (a tiny penalty).
  • ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
    • This is the main command. It fits a separate ridge model for every single \(\lambda\) in the grid.
    • alpha=0 is the specific command that tells glmnet to perform Ridge Regression. (Setting alpha=1 would be LASSO).
  • coef(ridge.mod)[,50]
    • This inspects the model. It pulls out the vector of coefficients for the 50th \(\lambda\) in the grid (which is \(\lambda=10.72\)).

Step 3: Visualize the Coefficient “Solution Path” (Slides 1, 4, 9)

These plots all show the same thing: how the coefficients change as \(\lambda\) changes.

  • Slide 9 Plot: This plots the standardized coefficients for 4 predictors (Income, Limit, Rating, Student) against the index (1 to 100). Index 1 (left) is the largest \(\lambda\), and index 100 (right) is the smallest \(\lambda\) (closest to OLS). You can see the coefficients “grow” from 0 as the penalty (\(\lambda\)) gets smaller.
  • Slide 1 (Left Plot): This is the same plot as Slide 9, but more professional. It plots the coefficients against \(\lambda\) on a log scale. You can clearly see all coefficients (gray lines) being “shrunk” toward zero as \(\lambda\) increases (moves right). The key predictors (Income, Rating, etc.) are highlighted.
  • Slide 1 (Right Plot): This is the exact same data again, but with a different x-axis: \(\|\hat{\beta}_\lambda^R\|_2 / \|\hat{\beta}\|_2\).
    • 1.0 on the right means \(\lambda=0\). The ratio of the ridge norm to the OLS norm is 1 (they are the same).
    • 0.0 on the left means \(\lambda=\infty\). The ridge coefficients are all 0, so their norm is 0.
    • This axis shows the “fraction” of the full OLS coefficient magnitude that the model is using.
  • Slide 4 Plot: This plots the total L2 norm of all coefficients (\(\|\hat{\beta}_\lambda^R\|_2\)) against the index. As the index goes from 1 to 100 (i.e., \(\lambda\) gets smaller), the total magnitude of the coefficients gets larger, which is exactly what we expect.

Step 4: Find the Best \(\lambda\) using Cross-Validation (Slides 4 & 7)

We have 100 models. Which one is best?

  • The “Manual” Way (Slide 4):

    • The code splits the data into a train and test set.
    • It fits a model only on the train set.
    • It tests two \(\lambda\) values:
      • s=4: Gives a test MSE of 10293.33.
      • s=10: Gives a test MSE of 168981.1 (much worse!).
    • This shows that \(\lambda=4\) is better than \(\lambda=10\), but we don’t know if it’s the best.
  • The “Automatic” Way (Slide 7):

    • cv.out=cv.glmnet(x[train,], y[train], alpha=0)
    • This runs 10-fold Cross-Validation on the training set. It automatically splits the training set into 10 “folds,” trains on 9, tests on 1, and repeats this 10 times for every \(\lambda\).
    • The Plot: The plot on this slide is the result. It shows the average MSE (y-axis) for each \(\log(\lambda)\) (x-axis). This is the real-data version of the theoretical purple curve from Slide 3.
    • bestlam=cv.out$lambda.min
    • This command finds the \(\lambda\) at the very bottom of the U-shaped curve. The output shows bestlam is 41.6.
    • ridge.pred=predict(ridge.mod, s=bestlam, newx=x[test,])
    • Now, we use this one best \(\lambda\) to make predictions on our held-out test set.
    • mean((ridge.pred-y.test)^2)
    • The final, reliable test MSE is 16129.68. This is our best estimate of how the model will perform on new, unseen data.

Python (scikit-learn) Equivalents

Here is how you would perform the entire R workflow from your slides in Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# --- 1. Load and Prepare Data (like Slide 8) ---
# Assuming 'Credit' is a pandas DataFrame
# X = Credit.drop('Balance', axis=1)
# y = Credit['Balance']
# ... (need to handle categorical variables first, e.g., with pd.get_dummies) ...
# For this example, let's assume X and y are already loaded and numeric.

# Standardize the predictors (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 2. Train/Test Split (like Slide 4) ---
# test_size=0.5 and random_state=1 mimic the R code
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.5, random_state=1
)

# --- 3. Find Best Lambda (alpha) with Cross-Validation (like Slide 7) ---
# Create the same log-spaced grid of lambdas (sklearn calls it 'alpha')
lambda_grid = np.logspace(4, -2, 100)

# RidgeCV performs cross-validation to find the best alpha
# cv=10 matches the 10-fold CV
# store_cv_values=True is needed to plot the CV error curve
cv_model = RidgeCV(alphas=lambda_grid, store_cv_values=True, cv=10)
cv_model.fit(X_train, y_train)

# Get the best lambda found
best_lambda = cv_model.alpha_
print(f"Best lambda (alpha) found by CV: {best_lambda}")

# Plot the CV error curve (like Slide 7 plot)
# cv_model.cv_values_ has shape (n_samples, n_alphas)
# We need to average over the samples for each alpha
mse_path = np.mean(cv_model.cv_values_, axis=0)
plt.figure()
plt.plot(np.log10(cv_model.alphas_), mse_path, marker='o')
plt.xlabel("Log(lambda)")
plt.ylabel("Mean Squared Error")
plt.title("Cross-Validation Error Path")
plt.show()

# --- 4. Evaluate on Test Set (like Slide 7) ---
# 'cv_model' is already refit on the full training set using the best_lambda
test_pred = cv_model.predict(X_test)
final_test_mse = mean_squared_error(y_test, test_pred)
print(f"Final Test MSE with best lambda: {final_test_mse}")

# --- 5. Get Final Coefficients (like Slide 7, bottom) ---
# The coefficients from the CV-trained model:
print(f"Intercept: {cv_model.intercept_}")
print("Coefficients:")
for coef, feature in zip(cv_model.coef_, X.columns):
print(f" {feature}: {coef}")

# --- 6. Plot the Solution Path (like Slide 1) ---
# To do this, we fit a Ridge model for each lambda and store the coefficients
coefs = []
for lam in lambda_grid:
model = Ridge(alpha=lam)
model.fit(X_scaled, y) # Fit on all data
coefs.append(model.coef_)

# Plot
plt.figure()
plt.plot(np.log10(lambda_grid), coefs)
plt.xlabel("Log(lambda)")
plt.ylabel("Standardized Coefficients")
plt.title("Ridge Solution Path")
plt.show()

7. Shrinkage Methods (Regularization)

These slides cover Shrinkage Methods, also known as Regularization, which are techniques used to improve on the standard least squares model, particularly when dealing with many variables or multicollinearity. The main focus is on LASSO regression.

Key Mathematical Formulas

The slides present two main, but equivalent, ways to formulate these methods.

1. Penalized Formulation (Slide 1)

This is the most common formulation. The goal is to minimize a function that is a combination of the Residual Sum of Squares (RSS) and a penalty term. The penalty discourages large coefficients.

  • LASSO (Least Absolute Shrinkage and Selection Operator): The goal is to find coefficients (\(\beta_0, \beta_j\)) that minimize: \[\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j|\]
    • Penalty: The \(L_1\) norm (\(\|\beta\|_1\)), which is the sum of the absolute values of the coefficients.
    • Key Property: This penalty can force some coefficients to be exactly zero, effectively performing automatic variable selection.

2. Constrained Formulation (Slide 2)

This alternative formulation minimizes the RSS subject to a constraint (a “budget”) on the size of the coefficients.

  • For Lasso: Minimize RSS subject to: \[\sum_{j=1}^{p} |\beta_j| \le s\] (The sum of the absolute values of the coefficients must be less than some budget \(s\).)

  • For Ridge: Minimize RSS subject to: \[\sum_{j=1}^{p} \beta_j^2 \le s\] (The sum of the squares of the coefficients (\(L_2\) norm) must be less than \(s\).)

Equivalence (Slide 3): For any penalty value \(\lambda\) used in the first formulation, there is a corresponding budget \(s\) in the second formulation that will give the exact same set of coefficients. \(\lambda\) and \(s\) are inversely related: a large \(\lambda\) (high penalty) corresponds to a small \(s\) (small budget).

Important Plots and Interpretation

Your slides show the two most important plots for understanding and using LASSO.

1. The Cross-Validation (CV) Plot (Slide 5)

This plot is crucial for choosing the best tuning parameter (\(\lambda\)).

  • X-axis: \(\text{Log}(\lambda)\). This is the penalty strength.
    • Right side (high \(\lambda\)): High penalty, simple model (many coefficients are 0), high bias, high Mean-Squared Error (MSE).
    • Left side (low \(\lambda\)): Low penalty, complex model (like standard linear regression), high variance, MSE starts to increase (overfitting).
  • Y-axis: Mean-Squared Error (MSE) from cross-validation.
  • Goal: Find the \(\lambda\) at the bottom of the “U” shape, which gives the lowest MSE. This is the optimal trade-off between bias and variance. The top axis shows how many variables are included in the model at each \(\lambda\).

2. The Coefficient Path Plot (Slide 6)

This plot is the best visualization for understanding what LASSO does.

  • Left Plot (vs. \(\lambda\)):
    • X-axis: The penalty strength \(\lambda\).
    • Y-axis: The standardized value of each coefficient.
    • How to read it: Start from the right (high \(\lambda\)). All coefficients are 0. As you move left, \(\lambda\) decreases, and the penalty is relaxed. Variables “enter” the model one by one (their coefficients become non-zero). You can see that ‘Rating’, ‘Income’, and ‘Student’ are the most important variables, as they are the first to become non-zero.
  • Right Plot (vs. \(L_1\) Norm Ratio):
    • This shows the exact same information as the left plot, but the x-axis is reversed and rescaled. An axis value of 0.0 means full penalty (all \(\beta=0\)), and 1.0 means no penalty.

Code Understanding (R to Python)

The slides use the glmnet package in R. The equivalent and most popular library in Python is scikit-learn.

1. Finding the Best \(\lambda\) (CV)

The R code cv.out=cv.glmnet(x[train,],y[train],alpha=1) performs cross-validation to find the best \(\lambda\).

  • Python Equivalent: Use LassoCV. It does the same thing: tests many \(\lambda\) values (called alphas in scikit-learn) and picks the best one.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.linear_model import LassoCV

# Create the LassoCV object
# cv=5 means 5-fold cross-validation
lasso_cv = LassoCV(cv=5, random_state=0)

# Fit the model to the training data
lasso_cv.fit(X_train, y_train)

# Get the best lambda (called alpha_ in sklearn)
best_lambda = lasso_cv.alpha_
print(f"Best lambda (alpha): {best_lambda}")

# Get the MSEs
# This is what's plotted in the CV plot
print(lasso_cv.mse_path_)

2. Fitting with the Best \(\lambda\) and Getting Coefficients

The R code lasso.coef=predict(out,type="coefficients",s=bestlam) gets the coefficients for the best \(\lambda\).

  • Python Equivalent: The LassoCV object is already refitted on the full training data using the best \(\lambda\). You can also fit a new Lasso model with that specific \(\lambda\).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

# --- Option 1: Use the already-fitted LassoCV object ---
print("Coefficients from LassoCV:")
print(lasso_cv.coef_)

# Make predictions on the test set
y_pred = lasso_cv.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {test_mse}")


# --- Option 2: Fit a new Lasso model with the best lambda ---
final_lasso = Lasso(alpha=best_lambda)
final_lasso.fit(X_train, y_train)

# Get coefficients (Slide 7 shows this)
# Note how some are 0!
print("\nCoefficients from new Lasso model:")
print(final_lasso.coef_)

The Core Problem: Two Equivalent Formulas

The slides show two ways of writing the same problem. Understanding this equivalence is key.

Formulation 1: The Penalized Method (Slides 1 & 4)

  • Formula: \[\min_{\beta} \left( \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \beta)^2 + \lambda \|\beta\|_1 \right)\]

    • \(\sum (y_i - \mathbf{x}_i^T \beta)^2\): This is the normal Residual Sum of Squares (RSS). We want to make this small (fit the data well).
    • \(\lambda \|\beta\|_1\): This is the \(L_1\) penalty.
      • \(\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|\) is the sum of the absolute values of the coefficients.
      • \(\lambda\) (lambda) is a tuning parameter. Think of it as a “penalty knob”.
  • How to think about \(\lambda\):

    • If \(\lambda = 0\): There is no penalty. This is just standard Ordinary Least Squares (OLS) regression. The model will likely overfit.
    • If \(\lambda\) is small: There’s a small penalty. Coefficients will shrink a little bit.
    • If \(\lambda\) is very large: The penalty is severe. The only way to make the penalty term small is to make the coefficients (\(\beta\)) themselves small. The model will eventually shrink all coefficients to exactly 0.

Formulation 2: The Constrained Method (Slides 2 & 3)

  • Formula: \[\min_{\beta} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \beta)^2 \quad \text{subject to} \quad \|\beta\|_1 \le s\]

  • How to think about \(s\):

    • This says: “Find the best-fitting model (minimize RSS) but you have a limited ‘budget’ \(s\) for the total size of your coefficients.”
    • If \(s\) is very large: The budget is huge. This constraint does nothing. You get the standard OLS solution.
    • If \(s\) is small: The budget is tight. You must shrink your coefficients to stay under the budget \(s\). To get the best fit, the model will be forced to set unimportant coefficients to 0 and only “spend” its budget on the most important variables.

The Equivalence: These two forms are equivalent. For any \(\lambda\) you pick, there’s a corresponding budget \(s\) that gives the exact same solution.

  • High \(\lambda\) (strong penalty) \(\iff\) Small \(s\) (tight budget)
  • Low \(\lambda\) (weak penalty) \(\iff\) Large \(s\) (loose budget)

This equivalence is why you see plots with both \(\lambda\) and \(L_1\) Norm on the x-axis. They are just two different ways of looking at the same “penalty” spectrum.

Detailed Plot & Code Analysis

Let’s look at the plots and code, which answer the practical questions: (1) How do we pick the best \(\lambda\)? and (2) What does LASSO do to the coefficients?

Question 1: How to pick the best \(\lambda\)? (Slide 5)

This is the Cross-Validation (CV) Plot. Its one and only job is to help you find the optimal \(\lambda\).

  • R Code: cv.out=cv.glmnet(x[train,],y[train],alpha=1)
    • cv.glmnet: This R function automatically does K-fold cross-validation. alpha=1 explicitly tells it to use LASSO (alpha=0 would be Ridge).
    • It tries a whole range of \(\lambda\) values, calculates the Mean-Squared Error (MSE) for each, and stores the results in cv.out.
  • Plot Analysis:
    • X-axis: \(\text{Log}(\lambda)\). The penalty strength. Right = High Penalty (simple model), Left = Low Penalty (complex model).
    • Y-axis: Mean-Squared Error (MSE). Lower is better.
    • Red Dots: The average MSE for each \(\lambda\).
    • Gray Bars: The error bars (standard error).
    • The “U” Shape: This is the classic bias-variance trade-off.
      • Right Side (High \(\lambda\)): The model is too simple (too many coefficients are 0). It’s “underfitting.” The error is high (high bias).
      • Left Side (High \(\lambda\)): The model is too complex (low penalty, like OLS). It’s “overfitting” the training data. The error on new data is high (high variance).
      • Bottom of the “U”: This is the “sweet spot.” The \(\lambda\) at the very bottom (marked by the left vertical dotted line) gives the lowest possible MSE. This is lambda.min.

Answer: You pick the \(\lambda\) that corresponds to the lowest point on this graph.

Question 2: What does LASSO do? (Slides 5, 6, 7)

These slides all show the effect of LASSO.

A. The Coefficient Path Plots (Slides 5 & 6)

These plots visualize how coefficients change. They show the same information just with different x-axes.

  • Left Plot (Slide 6) vs. \(\lambda\):
    • How to read: Read from RIGHT to LEFT.
    • At the far right (\(\lambda\) is large), all coefficients are 0.
    • As you move left, \(\lambda\) gets smaller, and the penalty is relaxed. Variables “enter” the model one by one as their coefficients become non-zero.
    • You can see ‘Rating’ (red-dashed), ‘Student’ (black-solid), and ‘Income’ (blue-dotted) are the first to enter, suggesting they are the most important predictors.
  • Right Plot (Slide 6) vs. \(L_1\) Norm Ratio:
    • This is the same plot, just flipped and rescaled. The x-axis is \(\|\hat{\beta}_\lambda\|_1 / \|\hat{\beta}_{OLS}\|_1\).
    • How to read: Read from LEFT to RIGHT.
    • At 0.0: This is a “0% budget” (like \(s=0\) or \(\lambda=\infty\)). All coefficients are 0.
    • At 1.0: This is a “100% budget” (like \(s=\infty\) or \(\lambda=0\)). This is the full OLS model.
    • This view clearly shows the coefficients “growing” from 0 as their “budget” (\(L_1\) Norm) increases.

B. The Code Output (Slide 7) - This is the most important “answer”

This slide explicitly demonstrates variable selection by comparing the coefficients from two different \(\lambda\) values.

  • First Block (The “Optimal” Model):

    • bestlam.cv <- cv.out$lambda.min: This gets the \(\lambda\) from the bottom of the “U” in the CV plot.
    • lasso.conf <- predict(out,type="coefficients",s=bestlam.cv)[1:12,]: This gets the coefficients using that best \(\lambda\).
    • lasso.conf[lasso.conf!=0]: This R command filters the list to show only the non-zero coefficients.
    • Result: The optimal model still keeps 10 variables (‘Income’, ‘Limit’, ‘Rating’, etc.). It has shrunk them, but it hasn’t set many to 0.
  • Second Block (The “High Penalty” Model):

    • The slide text says “if we choose a larger regularization parameter.” Here, they’ve picked an arbitrary larger value, s=10. (Note: R’s predict.glmnet can be confusing; s=10 here means \(\lambda=10\)).
    • lasso.conf <- predict(out,type="coefficients",s=10)[1:12,]: This gets the coefficients using a stronger penalty (\(\lambda=10\)).
    • lasso.conf[lasso.conf!=0]: Again, show only the non-zero coefficients.
    • Result: Look! The list is much shorter. The coefficients for ‘Age’, ‘Education’, ‘GenderFemale’, ‘MarriedYes’, and ‘Ethnicity’ are all gone (shrunk to 0.000000). The model has decided these are not important enough to “spend” budget on.

Conclusion: LASSO performs automatic variable selection. By increasing \(\lambda\), you create a sparser (simpler) model. Slide 7 is the concrete proof.

Python Equivalents (in more detail)

Here is how you would replicate the entire workflow from the slides in Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV, lasso_path
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# --- Assume X_train, y_train, X_test, y_test are loaded ---
# Example:
# data = pd.read_csv('Credit.csv')
# X = pd.get_dummies(data.drop(['ID', 'Balance'], axis=1), drop_first=True)
# y = data['Balance']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# It's CRITICAL to scale data before regularization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
feature_names = X.columns


# 1. Replicate the CV Plot (Slide 5: ...000200.png)
# LassoCV does what cv.glmnet does: finds the best lambda (alpha)
print("Running LassoCV to find best lambda (alpha)...")
# 'alphas' is the list of lambdas to try. We can let it choose automatically.
# cv=10 means 10-fold cross-validation.
lasso_cv = LassoCV(cv=10, random_state=1, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)

# The best lambda found
best_lambda = lasso_cv.alpha_
print(f"Best lambda (alpha) found: {best_lambda}")

# --- Plotting the CV (MSE vs. Log(Lambda)) ---
# This recreates the R plot
plt.figure(figsize=(10, 6))
# lasso_cv.mse_path_ is a (n_alphas, n_folds) array of MSEs
# We take the mean across the folds (axis=1)
mean_mses = np.mean(lasso_cv.mse_path_, axis=1)
log_lambdas = np.log10(lasso_cv.alphas_)

plt.plot(log_lambdas, mean_mses, 'r.-')
plt.xlabel('Log(Lambda / Alpha)')
plt.ylabel('Mean-Squared Error')
plt.title('LASSO Cross-Validation Path (Replicating R Plot)')
# Plot a vertical line at the best lambda
plt.axvline(np.log10(best_lambda), linestyle='--', color='k', label=f'Best Lambda (alpha) = {best_lambda:.2f}')
plt.legend()
plt.gca().invert_xaxis() # High lambda is on the right in R plot
plt.show()


# 2. Replicate the Coefficient Path Plot (Slide 6: ...000206.png)
# We can use the lasso_path function, or just use the CV object

# The lasso_cv object already calculated the paths!
coefs = lasso_cv.path(X_train_scaled, y_train, alphas=lasso_cv.alphas_)[1].T

plt.figure(figsize=(10, 6))
for i in range(X_train_scaled.shape[1]):
plt.plot(log_lambdas, coefs[:, i], label=feature_names[i])

plt.xlabel('Log(Lambda / Alpha)')
plt.ylabel('Standardized Coefficients')
plt.title('LASSO Coefficient Path (Replicating R Plot)')
plt.legend(loc='upper right')
plt.gca().invert_xaxis()
plt.show()


# 3. Replicate the Code Output (Slide 7: ...000202.png)
print("\n--- Replicating R Output ---")

# --- First Block: Coefficients with BEST lambda ---
print(f"Coefficients using best lambda (alpha = {best_lambda:.4f}):")
# The lasso_cv object is already fitted with the best lambda
best_coefs = lasso_cv.coef_
coef_series_best = pd.Series(best_coefs, index=feature_names)
# This is like R's `lasso.conf[lasso.conf != 0]`
print(coef_series_best[coef_series_best != 0])


# --- Second Block: Coefficients with a LARGER lambda ---
# Let's pick a larger lambda, e.g., 10 (like the slide)
large_lambda = 10
lasso_high_penalty = Lasso(alpha=large_lambda)
lasso_high_penalty.fit(X_train_scaled, y_train)

print(f"\nCoefficients using larger lambda (alpha = {large_lambda}):")
high_pen_coefs = lasso_high_penalty.coef_
coef_series_high = pd.Series(high_pen_coefs, index=feature_names)
# This is the second R command: `lasso.conf[lasso.conf != 0]`
print(coef_series_high[coef_series_high != 0])

# --- Final Prediction ---
# This is R's `mean((lasso.pred-y.test)^2)`
y_pred = lasso_cv.predict(X_test_scaled)
test_mse = mean_squared_error(y_test, y_pred)
print(f"\nTest MSE using best lambda: {test_mse:.2f}")

The “Game” of Regularization

First, let’s understand what these plots are showing. This is a “map” of a constrained optimization problem.

  • The Red Ellipses (RSS Contours): Think of these as contour lines on a topographic map.
    • The Center (\(\hat{\beta}\)): This point is the “bottom of the valley.” It represents the perfect, unconstrained solution—the standard Ordinary Least Squares (OLS) coefficients. This point has the lowest possible Residual Sum of Squares (RSS), or error.
    • The Lines: Every point on a single red ellipse has the exact same RSS. As the ellipses get bigger (moving away from the center \(\hat{\beta}\)), the error gets higher.
  • The Blue Shaded Area (Constraint Region): This is the “rule” of the game.
    • This is our “budget.” We are only allowed to pick a solution (\(\beta_1, \beta_2\)) from inside or on the boundary of this blue shape.
    • LASSO: The constraint is \(|\beta_1| + |\beta_2| \le s\). This equation forms a diamond (or a rotated square).
    • Ridge: The constraint is \(\beta_1^2 + \beta_2^2 \le s\). This equation forms a circle.
  • The Goal: Find the “best” point that is inside the blue area.
    • The “best” point is the one with the lowest possible error (RSS).
    • Geometrically, this means we start at the center (\(\hat{\beta}\)) and expand our ellipse outward. The very first point where the ellipse touches the blue constraint region is our solution.

Why LASSO Performs Variable Selection (The Diamond) 🎯

This is the most important concept. Look at the LASSO diagrams.

  • The Shape: The LASSO constraint is a diamond.
  • The Key Feature: This diamond has sharp corners (vertices). And most importantly, these corners lie exactly on the axes.
    • The top corner is at \((\beta_1=0, \beta_2=s)\).
    • The right corner is at \((\beta_1=s, \beta_2=0)\).
  • The “Collision”: Now, imagine the red ellipses (representing our error) expanding from the OLS solution (\(\hat{\beta}\)). They will almost always “hit” the blue diamond at one of its sharp corners.
    • Look at your textbook diagram (slide ...000304.png). The ellipse clearly makes contact with the diamond at the top corner, where \(\beta_1 = 0\).
    • Look at your example (slide ...000259.jpg). The center of the ellipses is at (4, 0.1). The closest point on the diamond that the expanding ellipses will hit is the corner at (2, 0). At this solution, \(y\) is exactly 0.

Conclusion: Because the \(L_1\) “diamond” has corners on the axes, the optimal solution is very likely to land on one of them. When it does, the coefficient for the other axis is set to exactly zero. This is the variable selection property.

Why Ridge Regression Only Shrinks (The Circle) 🤏

Now, look at the Ridge regression diagram.

  • The Shape: The Ridge constraint is a circle.
  • The Key Feature: A circle is perfectly smooth and has no corners.
  • The “Collision”: Imagine the same ellipses expanding and hitting the blue circle. The contact point will be a tangent point.
    • Because the circle is round, this tangent point can be anywhere on its circumference.
    • It is extremely unlikely that the contact point will be exactly on an axis (e.g., at \((\beta_1=0, \beta_2=s)\)). This would only happen if the OLS solution \(\hat{\beta}\) was already perfectly aligned with that axis.
  • Conclusion: The Ridge solution will find a point where both \(\beta_1\) and \(\beta_2\) are non-zero. The coefficients are “shrunk” (pulled in from \(\hat{\beta}\) towards the origin), but they never become zero. This is why Ridge is called a “shrinkage” method, but not a “variable selection” method.

Summary: Diamond vs. Circle

Feature LASSO (\(L_1\) Norm) Ridge (\(L_2\) Norm)
Constraint Shape Diamond (or hyper-rhombus) Circle (or hypersphere)
Key Feature Sharp corners on the axes Smooth curve with no corners
Geometric Solution Ellipses hit the corners Ellipses hit a smooth part
Result Forces some coefficients to exactly 0 Shrinks all coefficients towards 0
Name Variable Selection Shrinkage

The “space meaning” is that the sharp corners of the \(L_1\) diamond are what make variable selection possible. The smooth circle of the \(L_2\) norm does not have these corners and thus cannot force coefficients to zero.

8. Shrinkage Methods (Lasso vs. Ridge)

Core Concept: Shrinkage Methods

Both Ridge (L2) and Lasso (L1) are regularization techniques used to improve upon standard Ordinary Least Squares (OLS) regression.

Their main goal is to manage the bias-variance tradeoff. OLS often has low bias but very high variance, especially when you have many predictors (\(p\)) or when predictors are correlated. Ridge and Lasso improve prediction accuracy by shrinking the regression coefficients towards zero. This adds a small amount of bias but significantly reduces the variance, leading to a lower overall Test Mean Squared Error (MSE).

The Key Difference: Math & How They Shrink

The slides show that the two methods use different penalties, which leads to very different mathematical forms and practical outcomes.

  • Ridge Regression (L2 Penalty): Minimizes \(RSS + \lambda \sum_{j=1}^{p} \beta_j^2\)
  • Lasso Regression (L1 Penalty): Minimizes \(RSS + \lambda \sum_{j=1}^{p} |\beta_j|\)

Slide 80 provides the exact formulas for their coefficient estimates in a simple, orthogonal case (where predictors are independent):

Ridge Regression (Proportional Shrinkage)

  • Formula: \(\hat{\beta}_j^R = \hat{\beta}_j^{LSE} / (1 + \lambda)\)
  • What this means: Ridge shrinks every least squares coefficient by a proportional amount. It will make coefficients smaller, but it will never set them to exactly zero (unless \(\lambda\) is \(\infty\)).

Lasso Regression (Soft-Thresholding)

  • Formula: \(\hat{\beta}_j^L = \text{sign}(\hat{\beta}_j^{LSE})(|\hat{\beta}_j^{LSE}| - \lambda/2)_+\)
  • What this means: This is a “soft-thresholding” operator.
    • If the original coefficient \(\hat{\beta}_j^{LSE}\) is small (its absolute value is less than \(\lambda/2\)), Lasso sets it to exactly zero.
    • If the coefficient is large, Lasso subtracts \(\lambda/2\) from its absolute value, shrinking it towards zero.
  • Key Property: Because of this, Lasso performs automatic feature selection by eliminating predictors.

Important Images Explained

Most Important: Figure 6.10 (Slide 82)

This is the best visual for understanding the mathematical difference from the formulas above.

  • Left (Ridge): The red line shows the Ridge estimate vs. the OLS estimate. It’s a straight, diagonal line with a slope less than 1. It shrinks everything proportionally.
  • Right (Lasso): The red line shows the Lasso estimate. It’s “flat” at zero for a range, showing it sets small coefficients to zero. Then, it slopes up, but it’s shifted (it shrinks the large coefficients by a fixed amount).

Scenario 1: Figure 6.8 (Slide 76)

This plot shows what happens when all 45 predictors are truly related to the response.

  • Result (Slide 77): Ridge performs slightly better (has a lower minimum MSE, shown by the dotted purple line).
  • Why: Lasso’s assumption (that some coefficients are zero) is wrong in this case. By forcing some relevant predictors to zero, it adds too much bias. Ridge, by just shrinking all of them, finds a better balance.

Scenario 2: Figure 6.9 (Slide 78)

This plot shows the opposite scenario: only 2 out of 45 predictors are truly related (a “sparse” model).

  • Result: Lasso performs much better (its solid purple line has a much lower minimum MSE).
  • Why: Lasso’s assumption is correct. It successfully sets the 43 “noise” predictors to zero, which dramatically reduces variance, while correctly keeping the 2 important ones.

Python & Code Understanding

The slides don’t contain Python code, but they describe the exact concepts you would use, primarily in scikit-learn.

  • Implementing Ridge & Lasso:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import make_pipeline

    # It's crucial to scale data before regularization
    # alpha is the same as the λ (lambda) in your slides

    # --- Ridge ---
    # The math for Ridge is a "closed-form solution" (Slide 80)
    # ridge_model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))

    # --- Lasso ---
    # Lasso requires a numerical solver (like coordinate descent)
    # lasso_model = make_pipeline(StandardScaler(), Lasso(alpha=0.1))
  • The Soft-Thresholding Formula: The math from Slide 80, \(\text{sign}(y)(|y| - \lambda/2)_+\), is the core operation in the “coordinate descent” algorithm used to solve Lasso. You could write it in Python/Numpy:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    import numpy as np

    def soft_threshold(x, lambda_val):
    """Implements the Lasso soft-thresholding formula."""
    return np.sign(x) * np.maximum(0, np.abs(x) - (lambda_val / 2))

    # Example:
    # ols_coefficient = 1.5
    # threshold = 4.0
    # lasso_coefficient = soft_threshold(ols_coefficient, threshold)
    # print(lasso_coefficient) # Output: 0.0

    # ols_coefficient = 3.0
    # threshold = 4.0
    # lasso_coefficient = soft_threshold(ols_coefficient, threshold)
    # print(lasso_coefficient) # Output: 1.0 (it was 3.0, shrunk by 4/2 = 2)
  • Choosing \(\lambda\) (alpha): Slide 79 says to “Use cross validation to determine which one has better prediction.” In scikit-learn, this is done for you with RidgeCV and LassoCV, which automatically test a range of alpha values.

Summary: Lasso vs. Ridge

Feature Ridge (L2) Lasso (L1)
Penalty \(L_2\) norm: \(\lambda \sum \beta_j^2\) \(L_1\) norm: \(\lambda \sum |\beta_j|\)
Coefficient Shrinkage Proportional; shrinks all coefficients, but never to exactly zero. Soft-thresholding; can force coefficients to be exactly zero.
Feature Selection? No Yes, this is its main advantage.
Interpretability Less interpretable (keeps all \(p\) variables). More interpretable (produces a “sparse” model with fewer variables).
Best Used When… …most predictors are useful. (e.g., Slide 76: 45/45 relevant). …many predictors are “noise” and only a few are strong. (e.g., Slide 78: 2/45 relevant).
Computation Has a simple, closed-form solution. Requires numerical optimization (e.g., coordinate descent).

9. Shrinkage Methods (Ridge & LASSO)

Summary of Shrinkage Methods (Ridge & LASSO)

These slides introduce shrinkage methods, also known as regularization, a technique used in regression (like linear regression) to improve model performance. The main idea is to add a penalty to the model’s loss function to “shrink” the size of the coefficients. This helps to reduce model variance and prevent overfitting, especially when you have many features.

The two main methods discussed are Ridge Regression (\(L_2\) penalty) and LASSO (\(L_1\) penalty).

Key Mathematical Formulas

  1. Standard Linear Model: The problem starts with the standard linear regression model (from slide 1):

    \[ \]$$\mathbf{y} = \mathbf{X}\beta + \epsilon

    \[ \]$$ * \(\mathbf{y}\) is the \(n \times 1\) vector of observed outcomes.

    • \(\mathbf{X}\) is the \(n \times p\) matrix of \(p\) predictor features for \(n\) observations.
    • \(\beta\) is the \(p \times 1\) vector of coefficients (what we want to find).
    • \(\epsilon\) is the \(n \times 1\) vector of random errors.
    • The goal of standard “Ordinary Least Squares” (OLS) regression is to find the \(\beta\) that minimizes the loss: \(\|\mathbf{X}\beta - \mathbf{y}\|^2_2\).
  2. LASSO (L1 Regularization): LASSO (Least Absolute Shrinkage and Selection Operator) adds a penalty based on the absolute value of the coefficients (the \(L_1\)-norm). This is the key formula from slide 1:

    \[ \]$$\hat{\beta}(\lambda) \leftarrow \arg \min_{\beta} \left( |\mathbf{X}\beta - \mathbf{y}|^2_2 + \lambda|\beta|_1 \right)

    \[ \]$$ * \(\|\beta\|_1 = \sum_{j=1}^{p} |\beta_j|\)

    • \(\lambda\) (lambda) is the tuning parameter that controls the strength of the penalty. A larger \(\lambda\) means more shrinkage.
    • Key Property (Variable Selection): The \(L_1\) penalty can force some coefficients (\(\beta_j\)) to become exactly zero. This means LASSO simultaneously performs feature selection by automatically removing irrelevant predictors.
    • Support (Slide 1): The question “Can it recover the support of \(\beta\)?” is asking if LASSO can correctly identify the set of true non-zero coefficients (defined as \(S := \{j : \beta_j \neq 0\}\)).
  3. Ridge Regression (L2 Regularization): Ridge regression (mentioned on slide 2, shown on slide 3) adds a penalty based on the squared value of the coefficients (the \(L_2\)-norm).

    \[ \]$$\hat{\beta}(\lambda) \leftarrow \arg \min_{\beta} \left( |\mathbf{X}\beta - \mathbf{y}|^2_2 + \lambda|\beta|^2_2 \right)

    \[ \]$$ * \(\|\beta\|^2_2 = \sum_{j=1}^{p} \beta_j^2\)

    • Key Property (Shrinkage): The \(L_2\) penalty shrinks coefficients towards zero but never sets them to exactly zero (unless \(\lambda = \infty\)). It is effective at handling multicollinearity.

Important Images & Concepts

The most important images are the plots from slides 3 and 4. They illustrate the two most critical concepts: how to choose \(\lambda\) and what the penalty does to the coefficients.

Tuning Parameter Selection (Slides 3 & 4, Left Plots)

  • Problem: How do you find the best value for \(\lambda\)?
  • Solution: Cross-Validation (CV). The slides show 10-fold CV.
  • What the Plots Show: The left plots on slides 3 and 4 show the Cross-Validation Error (like MSE) for different values of the penalty.
    • The x-axis represents the penalty strength (either \(\lambda\) itself or a related measure like the shrinkage ratio \(\|\hat{\beta}_\lambda\|_1 / \|\hat{\beta}\|_1\)).
    • The y-axis is the prediction error.
    • The curve is typically U-shaped. The vertical dashed line marks the minimum of this curve. This minimum point corresponds to the optimal \(\lambda\), which provides the best balance between bias and variance, leading to the best-performing model on unseen data.

Coefficient Paths (Slides 3 & 4, Right Plots)

These “trace” plots are crucial for understanding the difference between Ridge and LASSO. They show how the value of each coefficient (y-axis) changes as the penalty strength (x-axis) changes.

  • Slide 3 (Ridge): As \(\lambda\) increases (moving right), all coefficient values are smoothly shrunk towards zero, but none of them actually hit zero.
  • Slide 4 (LASSO): As the penalty increases (moving from right to left, as the ratio \(s\) goes from 1.0 to 0.0), you can see coefficients “drop off” and become exactly zero one by one. The model with the optimal \(\lambda\) (vertical line) has selected only a few non-zero coefficients (the pink and teal lines), while all the grey lines have been set to zero. This is feature selection in action.

Key Discussion Points (Slide 2)

  • Non-linear models: You can apply these methods to non-linear models by first creating non-linear features (e.g., \(x_1^2\), \(x_2^2\), \(x_1 \cdot x_2\)) and then feeding them into a LASSO or Ridge model. The regularization will then select which of these linear or non-linear terms are important.
  • Correlated Features (Multicollinearity): The question “If \(x_j \approx x_k\), how does LASSO behave?” is a key weakness of LASSO.
    • LASSO: Tends to arbitrarily select one of the correlated features and set the others to zero. This can make the model unstable.
    • Ridge: Tends to shrink the coefficients of correlated features together, giving them similar (but smaller) values.
    • Elastic Net (not shown) is a hybrid of Ridge and LASSO that is often used to get the best of both worlds: it can select groups of correlated variables.

Python Code Understanding (using scikit-learn)

Here is how you would implement these concepts in Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Import necessary libraries
import numpy as np
from sklearn.linear_model import Lasso, Ridge, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# --- Assume you have your data ---
# X: your feature matrix (e.g., shape 100, 20)
# y: your target vector (e.g., shape 100,)
# X, y = ... load your data ...

# 1. It's crucial to scale your data before regularization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Find the optimal lambda (alpha) using Cross-Validation
# scikit-learn uses 'alpha' instead of 'lambda' for the tuning parameter.

# --- For LASSO ---
# LassoCV automatically performs cross-validation (e.g., cv=10)
# to find the best alpha.
lasso_cv_model = LassoCV(cv=10, random_state=0)
lasso_cv_model.fit(X_scaled, y)

# Get the best alpha (lambda)
best_alpha_lasso = lasso_cv_model.alpha_
print(f"Optimal alpha (lambda) for LASSO: {best_alpha_lasso}")

# Get the final coefficients
lasso_coeffs = lasso_cv_model.coef_
print(f"LASSO coefficients: {lasso_coeffs}")
# You will see that many of these are exactly 0.0

# --- For Ridge ---
# RidgeCV works similarly. It's often good to test alphas on a log scale.
ridge_alphas = np.logspace(-3, 3, 100) # 100 values from 0.001 to 1000
ridge_cv_model = RidgeCV(alphas=ridge_alphas, store_cv_values=True)
ridge_cv_model.fit(X_scaled, y)

# Get the best alpha (lambda)
best_alpha_ridge = ridge_cv_model.alpha_
print(f"Optimal alpha (lambda) for Ridge: {best_alpha_ridge}")

# Get the final coefficients
ridge_coeffs = ridge_cv_model.coef_
print(f"Ridge coefficients: {ridge_coeffs}")
# You will see these are small, but not exactly zero.

Bias-variance tradeoff

Key Mathematical Formulas & Concepts

LASSO: Sign Consistency

This is the “ideal” scenario for LASSO. Sign consistency means that, with enough data, the LASSO model not only selects the correct set of features (it recovers the “support” \(S\)) but also correctly identifies the sign (positive or negative) of their coefficients.

  • The Goal (Slide 1):

    \[ \]$$\text{sign}(\hat{\beta}(\lambda)) = \text{sign}(\beta)

    \[ \]$$This means the signs of our estimated coefficients \(\hat{\beta}(\lambda)\) match the signs of the true underlying coefficients \(\beta\).

  • The “Irrepresentable Condition” (Slide 1): This is the mathematical guarantee required for LASSO to achieve sign consistency.

    \[ \]$$|\mathbf{X}_{Sc}\top \mathbf{X}_S (\mathbf{X}_S^\top \mathbf{X}S)^{-1} \text{sign}(\beta_S)|\infty < 1

    \[ \]$$ * Plain English: This formula is a complex way of saying: The irrelevant features (\(\mathbf{X}_{S^c}\)) cannot be too strongly correlated with the true, relevant features (\(\mathbf{X}_S\)).

    • If an irrelevant feature is very similar (highly correlated) to a true feature, LASSO can get “confused” and might pick the wrong one, or its estimate will be unstable. This condition fails.

Ridge Regression: The Bias-Variance Tradeoff

  • The Formula (Slide 3):

    \[ \]$$\hat{\beta}{\text{ridge}}(\lambda) \leftarrow \arg \min{\beta} \left( |\mathbf{y} - \mathbf{X}\beta|^2 + \lambda|\beta|^2 \right)

    \[ \]$$(Note: This is the \(L_2\) penalty, so \(\|\beta\|^2 = \sum \beta_j^2\))

  • The Problem it Solves: Collinearity (Slide 2) When features are strongly correlated (e.g., \(x_i \approx x_j\)), regular methods fail:

    • LSE (OLS): Fails because the matrix \(\mathbf{X}^\top \mathbf{X}\) is “non-invertible” (or singular), so the math for the solution \(\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\) breaks down.
    • LASSO: Fails because the Irrepresentable Condition is violated. LASSO will tend to arbitrarily pick one of the correlated features and set the others to zero.
  • The Ridge Solution (Slide 3):

    1. Always has a solution: Adding the \(\lambda\) penalty makes the matrix math work, even if \(\mathbf{X}^\top \mathbf{X}\) is non-invertible.
    2. Groups variables: This is the key takeaway. Instead of arbitrarily picking one feature, Ridge tends to shrink the coefficients of collinear variables together.
    3. Bias-Variance Tradeoff: Ridge introduces bias into the estimates (they are “wrong” on purpose) to massively reduce variance (they are more stable and less sensitive to the specific training data). This trade-off usually leads to a much lower overall error (Mean Squared Error).

Important Images & Key Takeaways

  1. Slide 2 (Collinearity Failures): This is the most important “problem” slide. It clearly explains why you can’t always use standard LSE or LASSO. The fact that all three methods (LSE, LASSO, Forward Selection) fail with strong collinearity motivates the need for Ridge.

  2. Slide 3 (Ridge Properties): This is the most important “solution” slide. The two most critical points are:

    • Always unique solution for λ > 0
    • Collinear variables tend to be grouped! (This is the “fix” for the problem on Slide 2).

Python Code Understanding

Let’s demonstrate the key difference (Slide 3) in how LASSO and Ridge handle collinear features.

We will create two features, x1 and x2, that are nearly identical.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
from sklearn.linear_model import Lasso, Ridge

# 1. Create a dataset with 2 strongly correlated features
np.random.seed(0)
n_samples = 100
# x1: a standard feature
x1 = np.random.randn(n_samples)
# x2: almost identical to x1
x2 = x1 + 0.01 * np.random.randn(n_samples)

# Combine into our feature matrix X
X = np.c_[x1, x2]

# y: The target variable (let's say y = 2*x1 + 2*x2)
y = 2 * x1 + 2 * x2 + np.random.randn(n_samples)

# 2. Fit LASSO (alpha is the same as lambda)
# We use a moderate alpha
lasso_model = Lasso(alpha=1.0)
lasso_model.fit(X, y)

# 3. Fit Ridge (alpha is the same as lambda)
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X, y)

# 4. Compare the coefficients
print("--- Results for Correlated Features ---")
print(f"True Coefficients: [2.0, 2.0]")
print(f"LASSO Coefficients: {np.round(lasso_model.coef_, 2)}")
print(f"Ridge Coefficients: {np.round(ridge_model.coef_, 2)}")

Example Output:

1
2
3
4
--- Results for Correlated Features ---
True Coefficients: [2.0, 2.0]
LASSO Coefficients: [3.89 0. ]
Ridge Coefficients: [1.95 1.94]

Code Explanation:

  • LASSO: As predicted by the slides, LASSO failed to find the true model. It arbitrarily picked x1, gave it a large coefficient, and set x2 to zero. This is unstable and not what we wanted.
  • Ridge: As predicted by Slide 3, Ridge handled the collinearity perfectly. It identified that both x1 and x2 were important and “grouped” them by assigning them nearly identical, stable coefficients (1.95 and 1.94), which are very close to the true values of 2.0.

10. Elastic Net

Overall Summary

These slides introduce Elastic Net, a modern regression method that solves the major weaknesses of its two predecessors, Ridge and LASSO regression.

  • Ridge is good for collinearity (correlated features) but can’t do variable selection (it can’t set any feature’s coefficient to exactly zero).
  • LASSO is good for variable selection (it creates sparse models by setting coefficients to zero) but behaves unstably when features are correlated (it tends to randomly pick one and discard the others).

Elastic Net combines the L1 penalty of LASSO and the L2 penalty of Ridge. The result is a single, flexible model that:

  1. Performs variable selection (like LASSO).
  2. Handles correlated features stably by grouping them together (like Ridge).
  3. Can select more features than samples (\(p > n\)), which LASSO cannot do.

Slide 1: The Definition and Formula (File: ...020245.png)

This slide explains why Elastic Net was created and defines it mathematically.

  • The Problem: It states the exact trade-off:
    • “Ridge regression can handle collinearity, but cannot perform variable selection;”
    • “LASSO can perform variable selection, but performs poorly when collinearity;”
  • The Solution (The Formula): The core of the method is this optimization formula: \[\hat{\beta}_{eNet}(\lambda, \alpha) \leftarrow \arg \min_{\beta} \left( \underbrace{\|\mathbf{y} - \mathbf{X}\beta\|^2}_{\text{Loss}} + \lambda \left( \underbrace{\alpha\|\beta\|_1}_{\text{L1 Penalty}} + \underbrace{\frac{1-\alpha}{2}\|\beta\|_2^2}_{\text{L2 Penalty}} \right) \right)\]
  • Breaking Down the Formula:
    • \(\|\mathbf{y} - \mathbf{X}\beta\|^2\): This is the standard “Residual Sum of Squares” (RSS). We want to find coefficients (\(\beta\)) that make the model’s predictions (\(X\beta\)) as close as possible to the true values (\(y\)).
    • \(\lambda\) (Lambda): This is the master knob for total regularization strength. A larger \(\lambda\) means a bigger penalty, which “shrinks” all coefficients more.
    • \(\alpha\) (Alpha): This is the mixing parameter that balances L1 and L2. This is the key innovation.
      • \(\alpha\|\beta\|_1\): This is the L1 (LASSO) part. It forces weak coefficients to become exactly zero, thus selecting variables.
      • \(\frac{1-\alpha}{2}\|\beta\|_2^2\): This is the L2 (Ridge) part. It shrinks all coefficients and, crucially, encourages correlated features to have similar coefficients (the grouping effect).
  • The Special Cases:
    • If \(\alpha = 0\), the L1 term vanishes, and the model becomes pure Ridge Regression.
    • If \(\alpha = 1\), the L2 term vanishes, and the model becomes pure LASSO Regression.
    • If \(0 < \alpha < 1\), you get Elastic Net, which “encourages grouping of correlated variables” and “can perform variable selection.”

Slide 2: The Intuition and The Grouping Effect (File: ...020249.jpg)

This slide gives you the visual intuition and the practical proof of why Elastic Net works. It has two parts.

Part 1: The Three Graphs (Geometric Intuition)

These graphs show the constraint region (the shaded shape) for each penalty. The model tries to find the best coefficients (\(\theta_{opt}\)), and the final solution (the green dot) is the first point where the cost function (the blue ellipses) “touches” the constraint region.

  • L1 Norm (LASSO): The region is a diamond. Because of its sharp corners, the ellipses are very likely to hit a corner first. At a corner, one of the coefficients (e.g., \(\theta_1\)) is zero. This is a visual explanation of how LASSO creates sparsity (variable selection).
  • L2 Norm (Ridge): The region is a circle. It has no corners. The ellipses will hit a “smooth” point on the circle, shrinking both coefficients (\(\theta_1\) and \(\theta_2\)) but not setting either to zero. This is weight sharing.
  • L1 + L2 (Elastic Net): The region is a “rounded square”. It’s the perfect compromise.
    • It has “corners” (like LASSO) so it can still set coefficients to zero.
    • It has “curved edges” (like Ridge) so it’s more stable and handles correlated variables by finding a solution on an edge rather than a single sharp corner.

Part 2: The Formula (The Grouping Effect)

The text at the bottom explains Elastic Net’s “grouping effect.”

  • The Implication: “If \(x_j \approx x_k\), then \(\hat{\beta}_j \approx \hat{\beta}_k\).”
  • Meaning: If two features (\(x_j\) and \(x_k\)) are highly correlated (their values are very similar), Elastic Net will force their coefficients (\(\hat{\beta}_j\) and \(\hat{\beta}_k\)) to also be very similar.
  • Why this is good: This is the opposite of LASSO. LASSO would be unstable and might arbitrarily set \(\hat{\beta}_j\) to a large value and \(\hat{\beta}_k\) to zero. Elastic Net “groups” them: it will either keep both in the model with similar importance, or it will shrink both of them out of the model together. This is a much more stable and realistic result.
  • The Warning: “LASSO may be unstable in this case!” This directly highlights the problem that Elastic Net solves.

Slide 3: The Feature Comparison Table (File: ...020255.png)

This table is your “cheat sheet” for choosing the right model. It compares Ridge, LASSO, and Elastic Net on all their key properties.

  • Penalty: Shows the L2, L1, and combined penalties.
  • Sparsity: Can the model set coefficients to 0?
    • Ridge: No ❌
    • LASSO: Yes ✅
    • Elastic Net: Yes ✅
  • Variable Selection: This is a crucial row.
    • LASSO: Yes ✅, BUT it has a major limitation: if you have more features than samples (\(p > n\)), LASSO can select at most \(n\) features.
    • Elastic Net: Yes ✅, and it can select more than \(n\) variables. This makes it the clear choice for “wide” data problems (e.g., in genomics, where \(p=20,000\) features and \(n=100\) samples).
  • Grouping Effect: How does it handle correlated features?
    • Ridge: Strong ✅
    • LASSO: Weak ❌ (it “picks one”)
    • Elastic Net: Strong ✅
  • Solution Uniqueness: Is the answer stable?
    • Ridge: Always ✅
    • LASSO: No ❌ (not if \(X\) is “rank-deficient,” e.g., \(p > n\) or correlated features)
    • Elastic Net: Always ✅ (as long as \(\alpha < 1\), the Ridge component guarantees a unique, stable solution).
  • Use Case: When should you use each?
    • Ridge: For prediction, especially with multicollinearity.
    • LASSO: For interpretability and creating sparse models (when you think only a few features matter).
    • Elastic Net: The best all-arounder. Use it for correlated predictors, when \(p \gg n\), or when you need both sparsity + stability.

Code Understanding (Python scikit-learn)

When you use this in Python, be aware of a common confusion in the parameter names:

Concept (from your slides) scikit-learn Parameter Description
\(\lambda\) (Lambda) alpha The overall strength of regularization.
\(\alpha\) (Alpha) l1_ratio The mixing parameter between L1 and L2.

Example: An l1_ratio of 0 is Ridge. An l1_ratio of 1 is LASSO. An l1_ratio of 0.5 is a 50/50 mix.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.linear_model import ElasticNet, ElasticNetCV

# 1. Initialize a specific model
# This uses 0.5 for lambda (slide's alpha) and 0.1 for lambda (slide's lambda)
model = ElasticNet(alpha=0.1, l1_ratio=0.5)

# 2. A much better way: Find the best parameters automatically
# This will test l1_ratios of 0.1, 0.5, and 0.9
# and automatically find the best 'alpha' (strength) for each.
cv_model = ElasticNetCV(
l1_ratio=[.1, .5, .9],
cv=5 # 5-fold cross-validation
)

# 3. Fit the model to your data (X_train, y_train)
# cv_model.fit(X_train, y_train)

# 4. See the best parameters it found
# print(f"Best l1_ratio (slide's alpha): {cv_model.l1_ratio_}")
# print(f"Best alpha (slide's lambda): {cv_model.alpha_}")

11. High-Dimensional Data Analysis

The Core Problem: Large \(p\), Small \(n\)

The slides introduce the challenge of high-dimensional data, which is defined by having many more features (predictors) \(p\) than observations (samples) \(n\). This is often written as \(p \gg n\).

  • Example: Predicting blood pressure (the response \(y\)) using millions of genetic markers (SNPs) as features \(X\), but only having data from a few hundred patients.
  • Troubles:
    • Overfitting: Models become “too flexible” and learn the noise in the training data, rather than the true underlying pattern.
    • Non-Unique Solution: When \(p > n\), the standard least squares linear regression model doesn’t even have a unique solution.
    • Misleading Metrics: This leads to a common symptom: a very small training error (or high \(R^2\)) but a very large test error.

Most Important Image: The Overfitting Trap (Figure 6.23)

Figure 6.23 (from the first uploaded image) is the most critical visual for understanding the problem. It shows what happens when you add features (variables) that are completely unrelated to the outcome.

  • Left Plot (R²): The \(R^2\) on the training data increases towards 1. This looks like a perfect fit.
  • Center Plot (Training MSE): The Mean Squared Error on the training data decreases to 0. This also looks perfect.
  • Right Plot (Test MSE): The Mean Squared Error on the test data (new, unseen data) explodes. This reveals the model is garbage and has just memorized the training set.

⚠️ This is the key takeaway: In high dimensions, \(R^2\) and training MSE are useless and misleading metrics for model quality.

The Solution: Regularization & Model Selection

To combat overfitting, we must use less flexible models. The main strategy is regularization (also called shrinkage), which involves adding a penalty term to the cost function to “shrink” the model coefficients (\(\beta\)).

Mathematical Formulas & Python Code 🐍

The standard Least Squares cost function you try to minimize is: \[\text{RSS} = \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 \quad \text{or} \quad \|y - X\beta\|^2_2\] This fails when \(p > n\). The solutions modify this:

A. Ridge Regression (\(L_2\) Penalty)

  • Concept: Shrinks all coefficients towards zero, but never to zero. It’s good when many features are related to the outcome.
  • Math Formula: \[\text{Minimize: } \left( \|y - X\beta\|^2_2 + \lambda \sum_{j=1}^p \beta_j^2 \right)\]
    • The \(\lambda \sum_{j=1}^p \beta_j^2\) is the \(L_2\) penalty.
    • \(\lambda\) (lambda) is a tuning parameter that controls the penalty strength. A larger \(\lambda\) means more shrinkage.
  • Python (Scikit-learn):
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    from sklearn.linear_model import Ridge
    from sklearn.model_selection import cross_val_score

    # alpha is the lambda (λ) tuning parameter
    # We find the best alpha using cross-validation
    ridge_model = Ridge(alpha=1.0)

    # Fit the model
    ridge_model.fit(X_train, y_train)

    # Evaluate using test error (e.g., MSE on test set)
    # NOT with training R-squared
    test_score = ridge_model.score(X_test, y_test)

B. The Lasso (\(L_1\) Penalty)

  • Concept: This is a very important method. The \(L_1\) penalty can force coefficients to be exactly zero. This means Lasso performs automatic feature selection, creating a sparse model.
  • Math Formula: \[\text{Minimize: } \left( \|y - X\beta\|^2_2 + \lambda \sum_{j=1}^p |\beta_j| \right)\]
    • The \(\lambda \sum_{j=1}^p |\beta_j|\) is the \(L_1\) penalty.
    • Again, \(\lambda\) is the tuning parameter.
  • Python (Scikit-learn):
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    from sklearn.linear_model import Lasso

    # alpha is the lambda (λ) tuning parameter
    lasso_model = Lasso(alpha=0.1)

    # Fit the model
    lasso_model.fit(X_train, y_train)

    # The model automatically selects features
    # Coefficients that are zero were 'dropped'
    print(lasso_model.coef_)

C. Other Methods

The slides also mention:

  • Forward Stepwise Selection: A different approach where you start with no features and add them one by one, picking the one that improves the model most (based on a criterion like cross-validation error).
  • Principal Components Regression (PCR): A dimensionality reduction technique.

The Curse of Dimensionality (Figure 6.24)

This example (Figures 6.24 and its description) shows a more subtle problem.

  • Setup: A model with \(n=100\) observations and 20 true features.
  • Plots: They test Lasso by adding more and more irrelevant features:
    • \(p=20\) (Left): Lasso performs well. The lowest test MSE is found with minimal regularization.
    • \(p=50\) (Center): Lasso still works well, but it needs more regularization (a smaller “Degrees of Freedom”) to filter out the 30 junk features.
    • \(p=2000\) (Right): This is the curse of dimensionality. Even with a good method like Lasso, the 1,980 irrelevant features add so much noise that the model performs poorly regardless of the tuning parameter. The true signal is “lost in the noise.”

Summary: Cautions for \(p > n\)

The final slide gives the most important rules to follow:

  1. Beware Extreme Multicollinearity: When \(p > n\), your features are mathematically guaranteed to be linearly related, which breaks standard regression.
  2. Don’t Overstate Results: A model you find (e.g., with Lasso) is just one of many potentially good models.
  3. 🚫 DO NOT USE training \(R^2\), \(p\)-values, or training MSE to justify your model. As Figure 6.23 showed, they are misleading.
  4. ✅ DO USE test error and cross-validation error to choose your model and assess its performance.

The Core Problem: \(p \gg n\) (The “Troubles” Slide)

This slide (filename: ...020259.png) sets up the entire problem. The issue isn’t just “overfitting”; it’s a fundamental mathematical breakdown of standard methods.

  • “Large \(p\) makes our linear regression model too flexible”: This is an understatement. It leads to a problem called an underdetermined system.
  • “If \(p > n\), the LSE is not even uniquely determined”: This is the most important technical point.
    • Mathematical Reason: The standard solution for Ordinary Least Squares (OLS) is \(\hat{\beta} = (X^T X)^{-1} X^T y\).
    • \(X\) is the data matrix with \(n\) rows (observations) and \(p\) columns (features).
    • The matrix \(X^T X\) has dimensions \(p \times p\).
    • When \(p > n\), the \(X^T X\) matrix is singular, which means its determinant is zero and it cannot be inverted. The \((X^T X)^{-1}\) term does not exist.
    • “Extreme multicollinearity” (from slide ...020744.png) is the direct cause. When \(p > n\), the columns of \(X\) (the features) are guaranteed to be linearly dependent. There are infinite combinations of the features that can explain the data.

The Simplest Example: \(n=2\) (Figure 6.22)

This slide (filename: ...020728.png) is the perfect illustration of the “not uniquely determined” problem.

  • Left Plot (Low-D): Many points (\(n\)), only two parameters (\(p=2\): intercept \(\beta_0\) and slope \(\beta_1\)). The line is a “best fit” that balances the errors. The training error (RSS) is non-zero.
  • Right Plot (High-D): We have \(n=2\) observations and \(p=2\) parameters.
    • You have two equations (one for each point) and two unknowns (\(\beta_0\) and \(\beta_1\)).
    • The model has exactly enough flexibility to pass perfectly through both points.
    • The result is zero training error.
    • This “perfect” fit is an illusion. If you got a new data point, this line would almost certainly be a terrible predictor. This is the essence of overfitting.

The Consequence: Misleading Metrics (Figure 6.23)

This slide (filename: ...020730.png) scales up the problem from \(n=2\) to \(n=20\) and shows why you must be cautious.

  • The Setup: \(n=20\) observations. We start with 1 feature and add more and more irrelevant, junk features.
  • Left Plot (\(R^2\)): The \(R^2\) on the training data steadily increases towards 1 as we add features. This is because, by pure chance, each new junk feature can explain a tiny bit more of the noise in the training set.
  • Center Plot (Training MSE): The training error drops to 0. This is the same as the \(n=2\) plot. Once the number of features (\(p\)) gets close to the number of observations (\(n=20\)), the model can perfectly fit the 20 data points, even if the features are random noise.
  • Right Plot (Test MSE): This is the “truth.” The actual error on new, unseen data gets worse and worse. By adding noise features, we are just “memorizing” the training set, and our model’s ability to generalize is destroyed.
  • Key Lesson: (from slide ...020744.png) This is why you must “Avoid using… \(p\)-values, \(R^2\), or other traditional measures of model on training as evidence of good fit.” They are guaranteed to lie to you when \(p > n\).

The Solutions (The “Deal with…” Slide)

This slide (filename: ...020734.png) lists the strategies to fix this. The core idea is regularization (or shrinkage). We add a “penalty” to the cost function to stop the \(\beta\) coefficients from getting too large or too numerous.

A. Ridge Regression (\(L_2\) Penalty)

  • Concept: Keeps all \(p\) features, but shrinks their coefficients. It’s excellent for handling multicollinearity.
  • Math: \(\text{Minimize: } \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 + \lambda \sum_{j=1}^p \beta_j^2\)
    • The first part is the standard RSS.
    • The \(\lambda \sum \beta_j^2\) is the \(L_2\) penalty. It punishes large coefficient values.
  • \(\lambda\) (Lambda): This is the tuning parameter.
    • If \(\lambda=0\), it’s just OLS (which fails).
    • If \(\lambda \to \infty\), all \(\beta\)’s are shrunk to 0.
    • The right \(\lambda\) is chosen via cross-validation.

B. The Lasso (\(L_1\) Penalty)

  • Concept: This is often preferred because it performs automatic feature selection. It shrinks many coefficients to be exactly zero.
  • Math: \(\text{Minimize: } \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij}\beta_j\right)^2 + \lambda \sum_{j=1}^p |\beta_j|\)
    • The \(\lambda \sum |\beta_j|\) is the \(L_1\) penalty. This absolute value penalty is what allows coefficients to become exactly 0.
  • Benefit: The final model is sparse (e.g., it might say “out of 2,000 features, only these 15 matter”).

C. Tuning Parameter Choice (The Real Work)

How do you pick the best \(\lambda\)? You must use the data you have. The slides mention this and “cross validation error” (from ...020744.png).

  • Python Code (Scikit-learn): You don’t just guess alpha (which is \(\lambda\) in scikit-learn). You use a tool like LassoCV or GridSearchCV to find the best one.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    from sklearn.linear_model import LassoCV
    from sklearn.datasets import make_regression

    # Create a high-dimensional dataset
    X, y = make_regression(n_samples=100, n_features=500, n_informative=10, noise=0.1)

    # LassoCV automatically performs cross-validation to find the best alpha (lambda)
    # cv=10 means 10-fold cross-validation
    lasso_cv_model = LassoCV(cv=10, random_state=0, max_iter=10000)

    # Fit the model
    lasso_cv_model.fit(X, y)

    # This is the best lambda (alpha) it found:
    print(f"Best alpha (lambda): {lasso_cv_model.alpha_}")

    # You can now see the coefficients
    # Most of the 500 coefficients will be 0.0
    print(f"Number of non-zero features: {np.sum(lasso_cv_model.coef_ != 0)}")

A Final Warning: The Curse of Dimensionality (Figure 6.24)

This final set of slides (filenames: ...020738.png and ...020741.jpg) provides a crucial, subtle warning: Regularization is not magic.

  • The Setup: \(n=100\) observations. There are 20 real features that truly affect the response.
  • The Experiment: They run Lasso three times, adding more and more noise features:
    • Left Plot (\(p=20\)): All 20 features are real. The lowest test MSE is found with minimal regularization (high “Degrees of Freedom,” meaning many non-zero coefficients). This makes sense; you want to keep all 20 real features.
    • Center Plot (\(p=50\)): Now we have 20 real features + 30 noise features. Lasso still works! The best model is found with more regularization (fewer “Degrees of Freedom”). Lasso successfully “zeroed out” many of the 30 noise features.
    • Right Plot (\(p=2000\)): This is the curse of dimensionality. We have 20 real features + 1980 noise features. The noise has completely overwhelmed the signal. Lasso fails. The test MSE is high no matter what tuning parameter you choose. The model cannot distinguish the 20 real features from the 1980 junk ones.

Final Takeaway: Even with advanced methods like Lasso, if your \(p \gg n\) problem is too extreme (i.S. the signal-to-noise ratio is too low), it may be impossible to build a good predictive model.

The Goal: “Collaborative Filtering”

The first slide (...021218.png) uses the term Collaborative Filtering. This is the key concept. The model “collaborates” by using the ratings of all users to fill in the blanks for a single user.

  • How it works: The model assumes your “taste” (vector \(\mathbf{u}_i\)) can be described as a combination of \(r\) “latent features” (e.g., \(r=3\): % action, % comedy, % drama). It also assumes each movie (vector \(\mathbf{v}_j\)) has a profile on these same features.
  • Your predicted rating for a movie is the dot product of your taste vector and the movie’s feature vector.
  • The model finds the best “taste” vectors \(\mathbf{U}\) and “movie” vectors \(\mathbf{V}\) that explain all the known ratings simultaneously. It’s collaborative because Lee’s ratings help define the features of “Bullet Train” (\(\mathbf{v}_2\)), which in turn helps predict Yang’s rating for that same movie.

The Hard Problem (and its 2 Flavors)

The second slide (...021222.png) presents the intuitive, but computationally very hard, way to frame the problem.

Detail 1: Noise vs. No Noise

The slide shows \(\mathbf{Y} = \mathbf{M} + \mathbf{E}\). This is critical. * \(\mathbf{M}\) is the “true,” “clean,” underlying low-rank matrix of everyone’s “true” preferences. * \(\mathbf{E}\) is a matrix of random noise. (e.g., your true rating is 4.3, but you entered a 4; or you were in a bad mood and rated a 3). * \(\mathbf{Y}\) is the noisy data we actually observe.

Because of this noise, we don’t expect to find a matrix \(\mathbf{N}\) that perfectly matches our data. Instead, we try to find a low-rank \(\mathbf{N}\) that is as close as possible. This leads to the formula: \[\underset{\text{rank}(\mathbf{N}) \le r}{\text{minimize}} \quad \left\| \mathcal{P}_{\mathcal{O}}(\mathbf{Y} - \mathbf{N}) \right\|_{\text{F}}^2\] This says: “Find a matrix \(\mathbf{N}\) (of rank \(r\) or less) that minimizes the sum of squared errors only on the ratings we observed (\(\mathcal{O}\)).”

Detail 2: Why is \(\text{rank}(\mathbf{N}) \le r\) a “Non-convex constraint”?

This is the “difficult to optimize” part. A convex problem is (simplistically) one with a single valley, making it easy to find the single lowest point. A non-convex problem has many local valleys, and an algorithm can get stuck in a “pretty good” valley instead of the “best” one.

The rank constraint is non-convex. For example, the average of two rank-1 matrices is not necessarily a rank-1 matrix (it could be rank-2). This lack of a “smooth valley” property makes the problem NP-hard.

Detail 3: The Number of Parameters: \(r(d_1 + d_2)\)

The slide asks, “how many entries are needed?” The answer is based on the number of unknown parameters. * A rank-\(r\) matrix \(\mathbf{M}\) can be factored into \(\mathbf{U}\) (which is \(d_1 \times r\)) and \(\mathbf{V}^T\) (which is \(r \times d_2\)). * The number of entries in \(\mathbf{U}\) is \(d_1 \times r\). * The number of entries in \(\mathbf{V}\) is \(d_2 \times r\). * Total “unknowns” to solve for: \(d_1 r + d_2 r = r(d_1 + d_2)\). * This means we must have at least \(r(d_1 + d_2)\) observed ratings to have any hope of uniquely solving for \(\mathbf{U}\) and \(\mathbf{V}\). If our number of observations \(|\mathcal{O}|\) is less than this, the problem is hopelessly underdetermined.

The “Magic” Solution: Convex Relaxation

The final slide (...021225.png) presents the groundbreaking solution from Candès and Recht. This solution cleverly changes the problem to one that is convex and solvable.

Detail 1: The L1-Norm Analogy (This is the most important concept)

This is the key to understanding why this works.

  • In Vectors (Lasso):
    • Hard Problem: Find the sparsest vector \(\beta\) (fewest non-zeros). This is \(L_0\) norm, \(\text{minimize } \|\beta\|_0\). This is non-convex.
    • Easy Problem: Minimize the \(L_1\) norm, \(\text{minimize } \|\beta\|_1 = \sum |\beta_j|\). This is convex, and it’s a “relaxation” that also produces sparse solutions.
  • In Matrices (Matrix Completion):
    • Hard Problem: Find the lowest-rank matrix \(\mathbf{X}\). Rank is the number of non-zero singular values. This is \(\text{minimize } \text{rank}(\mathbf{X})\). This is non-convex.
    • Easy Problem: Minimize the Nuclear Norm, \(\text{minimize } \|\mathbf{X}\|_* = \sum \sigma_i(\mathbf{X})\) (where \(\sigma_i\) are the singular values). This is convex, and it’s the “matrix equivalent” of the \(L_1\) norm. It’s a relaxation that also produces low-rank solutions.

Detail 2: Noiseless vs. Noisy (Again)

Notice the constraint in this new problem: \[\text{Minimize } \quad \|\mathbf{X}\|_*\] \[\text{Subject to } \quad X_{ij} = M_{ij}, \quad (i, j) \in \mathcal{O}\]

This formulation is for the noiseless case. It assumes the \(M_{ij}\) we observed are perfectly accurate. It demands that our solution \(\mathbf{X}\) exactly matches the known ratings. This is different from the optimization problem on the previous slide, which just tried to get close to the noisy data \(\mathbf{Y}\).

(In practice, you solve a noisy-aware version that combines both ideas, but the slide shows the original, “exact completion” problem.)

Detail 3: The Guarantee (What the math at the bottom means)

\[\text{If } \mathcal{O} \text{ is randomly sampled and } |\mathcal{O}| \gg r(d_1+d_2)\log(d_1+d_2), \text{... then the solution is unique and } \mathbf{M} \text{...}\]

This is the punchline. The Candès paper proved that if you have enough (but still very few) randomly sampled ratings, solving this easy convex problem (minimizing the nuclear norm) will magically give you the exact, true, low-rank matrix \(\mathbf{M}\).

  • \(|\mathcal{O}| \gg r(d_1+d_2)\): This part makes sense. We need at least as many observations as our \(r(d_1+d_2)\) degrees of freedom.
  • \(\log(d_1+d_2)\): This “log” factor is the “price” we pay for not knowing where the information is. It’s an astonishingly small price.
  • Example: For a 1,000,000 user x 10,000 movie matrix (like Netflix) with \(r=10\), you don’t need \(\approx 10^{10}\) ratings. You need a number closer to \(10 \times (10^6 + 10^4) \times \log(\dots)\), which is dramatically smaller. This is why this method is practical.

统计机器学习Lecture-5

Lecturer: Prof.XIA DONG

1. Resampling

Resampling as a statistical tool to assess the accuracy of models whose main goal is to estimate the test error (a model’s performance on new, unseen data) because the training error is overly optimistic due to overfitting.

重采样是一种统计工具,用于评估模型的准确性,其主要目标是估计测试误差(模型在新的、未见过的数据上的表现),因为由于过拟合导致训练误差过于乐观。

Key Concepts

  • Resampling: The process of repeatedly drawing samples from a dataset. The two main types mentioned are Cross-validation (to estimate model test error) and Bootstrap (to quantify the uncertainty of estimates). 从数据集中反复抽取样本的过程。主要提到的两种类型是交叉验证(用于估计模型测试误差)和自举(用于量化估计的不确定性)。
  • Data Splitting (Ideal Scenario): In a “data-rich” situation, you split your data into three parts: **在“数据丰富”的情况下,您可以将数据拆分为三部分:
    1. Training Data: Used to fit and train the parameters of various models.用于拟合和训练各种模型的参数。
    2. Validation Data: Used to assess the trained models, tune hyperparameters (e.g., choose the polynomial degree), and select the best model. This helps prevent overfitting.用于评估已训练的模型、调整超参数(例如,选择多项式的次数)并选择最佳模型。这有助于防止过度拟合。
    3. Test Data: Used only once on the final, selected model to get an unbiased estimate of its real-world performance. 在最终选定的模型上仅使用一次,以获得其实际性能的无偏估计。
  • Validation vs. Test Data: The slides emphasize this difference (Slide 7). The validation set is part of the model-building and selection process. The test set is kept separate and is only used for the final report card after all decisions are made.验证集是模型构建和选择过程的一部分。测试集是独立的,仅在所有决策完成后用于最终报告。

The Validation Set Approach

This is the simplest cross-validation method.这是最简单的交叉验证方法。

  1. Split: The total dataset is randomly divided into two parts: a training set and a validation set (often a 50/50 or 70/30 split).将整个数据集随机分成两部分:训练集验证集(通常为 50/50 或 70/30 的比例)。
  2. Train: Various models are fit only on the training set.各种模型训练集上进行拟合。
  3. Validate: The performance of each trained model is evaluated using the validation set. 使用验证集评估每个训练模型的性能。
  4. Select: The model with the best performance (e.g., the lowest error) on the validation set is chosen as the final model. 选择在验证集上性能最佳(例如,误差最小)的模型作为最终模型。

Important Image: Schematic (Slide 10)

This diagram clearly shows a set of \(n\) observations being randomly split into a training set (blue, with observations 7, 22, 13) and a validation set (beige, with observation 91). The model learns from the blue set and is tested on the beige set. 此图清晰地展示了一组 \(n\) 个观测值被随机分成训练集(蓝色,观测值编号为 7、22、13)和验证集(米色,观测值编号为 91)。模型从蓝色数据集进行学习,并在米色数据集上进行测试。

Example: Auto Data (Formulas & Code)

The slides use the Auto dataset to decide the best polynomial degree to predict mpg from horsepower.

Mathematical Models

The models being compared are polynomials of different degrees. For example:

  • Linear: \(mpg = \beta_0 + \beta_1(horsepower)\)

  • Quadratic: \(mpg = \beta_0 + \beta_1(horsepower) + \beta_2(horsepower)^2\)

  • Cubic: \(mpg = \beta_0 + \beta_1(horsepower) + \beta_2(horsepower)^2 + \beta_3(horsepower)^3\)

  • 线性\(mpg = \beta_0 + \beta_1(马力)\)

  • 二次\(mpg = \beta_0 + \beta_1(马力) + \beta_2(马力)^2\)

  • 三次\(mpg = \beta_0 + \beta_1(马力) + \beta_2(马力)^2 + \beta_3(马力)^3\)

The performance metric used is the Mean Squared Error (MSE) on the validation set: 使用的性能指标是验证集上的均方误差 (MSE)\[MSE_{val} = \frac{1}{n_{val}} \sum_{i \in val} (y_i - \hat{f}(x_i))^2\] where \(n_{val}\) is the number of observations in the validation set, \(y_i\) is the true mpg value, and \(\hat{f}(x_i)\) is the model’s prediction for the \(i\)-th observation in the validation set. 其中 \(n_{val}\) 是验证集中的观测值数量, \(y_i\) 是真实的 mpg 值,\(\hat{f}(x_i)\) 是模型对验证集中第 \(i\) 个观测值的预测。 ### Important Image: Polynomial Fits (Slide 8) 多项式拟合(幻灯片 8)

This plot is crucial. It shows the Auto data with linear (red), quadratic (green), and cubic (blue) regression lines. * The linear fit is clearly poor. * The quadratic and cubic fits follow the data’s curve much better. * The inset box shows the MSE calculated on the full dataset (this is training MSE): * Linear MSE: ~26.42 * Quadratic MSE: ~21.60 * Cubic MSE: ~21.51 This suggests a non-linear fit is necessary, but it doesn’t tell us which one will generalize better.

这张图至关重要。它用线性(红色)、二次(绿色)和三次(蓝色)回归线展示了 Auto 数据。 * 线性拟合 明显较差。 * 二次和三次拟合 更能贴合数据曲线。 * 插图显示了基于 完整数据集 计算的均方误差(这是训练均方误差): * 线性均方误差:~26.42 * 二次均方误差:~21.60 * 三次均方误差:~21.51 这表明非线性拟合是必要的,但它并没有告诉我们哪种拟合方式的泛化效果更好。 ### Code Analysis

The slides show two different approaches in code:

1. Python Code (Slide 9): Model Selection Criteria

  • What it does: This Python code (using pandas and statsmodels) does not implement the validation set approach. Instead, it fits polynomial models (degrees 1 through 5) to the entire dataset.
  • How it works: It calculates statistical criteria like BIC, Mallow’s \(C_p\), and Adjusted \(R^2\). These are mathematical adjustments to the training error that estimate the test error without needing a validation set. 它计算统计标准,例如BICMallow 的 \(C_p\)** 和调整后的 \(R^2\)。这些是对训练误差的数学调整,无需验证集即可估算测试误差。
  • Key line (logic): sm.OLS(y, X).fit() is used to fit the model, and then metrics like model.bic and model.rsquared_adj are extracted.
  • Result: The table shows that the model with [horsepower, horsepower2] (quadratic) has the lowest BIC and \(C_p\) values, suggesting it’s the best model according to these criteria.
  • 结果:表格显示,带有 [马力, 马力2](二次函数)的模型具有最低的 BIC 和 \(C_p\) 值,这表明根据这些标准,它是最佳模型。

2. R Code (Slides 14 & 15): The Validation Set Approach

  • What it does: This R code directly implements the validation set approach described on Slide 13.
  • How it works:
    1. set.seed(...): Sets a random seed to make the split reproducible.
    2. train=sample(392, 196): Randomly selects 196 indices (out of 392) to be the training set.
    3. lm.fit=lm(mpg~poly(horsepower, 2), ..., subset=train): Fits a quadratic model only using the train data.
    4. mean((mpg-predict(lm.fit,Auto))[-train]^2): This is the key calculation.
      • predict(lm.fit, Auto): Predicts mpg for all data.
      • [-train]: Selects only the predictions for the validation set (the data not in train).
      • mean(...): Calculates the MSE on the validation set.
  • Result: The code is run three times with different seeds (1, 2022, 1997).
    • Seed 1: Quadratic MSE (18.71) is lowest.
    • Seed 2022: Quadratic MSE (19.70) is lowest.
    • Seed 1997: Quadratic MSE (19.08) is lowest.
  • Main Takeaway: In all random splits, the quadratic model gives the lowest validation set MSE. This provides evidence that the quadratic model is the best choice for generalizing to new data. The fact that the MSE values change with each seed also highlights a key disadvantage of this simple method: the results can be variable depending on the random split. 主要结论:在所有随机拆分中,**二次模型的验证集 MSE 最低。这证明了二次模型是推广到新数据的最佳选择。MSE 值随每个种子变化的事实也凸显了这种简单方法的一个关键缺点:结果可能会因随机拆分而变化。

2. The Validation Set Approach 验证集方法

This method is a simple way to estimate a model’s performance on new, unseen data (the “test error”). 这种方法是一种简单的方法,用于评估模型在新的、未见过的数据(“测试误差”)上的性能。 The core idea is to randomly split your available data into two parts: 其核心思想是将可用数据随机拆分为两部分: 1. Training Set: Used to fit (or “train”) your model. 用于拟合(或“训练”)模型。 2. Validation Set (or Test Set): Used to evaluate the trained model’s performance. You calculate the error (like Mean Squared Error) on this set. 用于评估训练后的模型性能。计算此集合的误差(例如均方误差)。

Python Code Explained (Slide 1)

The first slide shows a Python example using the Auto dataset to predict mpg from horsepower.

  1. Setup & Data Loading:
    • import statements load libraries like pandas (for data), sklearn.model_selection.train_test_split (the key function for this method), and sklearn.linear_model.LinearRegression.
    • Auto = pd.read_csv(...) loads the data.
    • X = Auto['horsepower'].values and y = Auto['mpg'].values select the variables of interest.
  2. The Split:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=007)
    • This is the most important line for this method. It splits the data X and y into training and testing (validation) sets.
    • train_size=0.5 means 50% of the data is for training and 50% is for validation.
    • random_state=007 ensures the split is “random” but “reproducible” (using the same seed 007 will always produce the same split).
  3. Model Fitting & Evaluation:
    • The code fits three different polynomial models, but it only uses the training data (X_train, y_train) to do so.
    • Linear (Degree 1): A simple LinearRegression.
    • Quadratic (Degree 2): Uses PolynomialFeatures(2) to create \(x\) and \(x^2\) terms, then fits a linear model to them.
    • Cubic (Degree 3): Uses PolynomialFeatures(3) to create \(x\), \(x^2\), and \(x^3\) terms.
    • It then calculates the Mean Squared Error (MSE) for all three models using the test data (X_test, y_test).
  4. Results (from the text on the slide):
    • Linear MSE: \(\approx 23.3\)
    • Quadratic MSE: \(\approx 19.4\)
    • Cubic MSE: \(\approx 19.4\)
    • Conclusion: The quadratic model gives a significantly lower error than the linear model. The cubic model does not offer any real improvement over the quadratic one.
    结果(来自幻灯片上的文字):
    • 线性均方误差:约 23.3
    • 二次均方误差:约 19.4
    • 三次均方误差:约 19.4
    • 结论:二次模型的误差显著低于线性模型。三次模型与二次模型相比并没有任何实质性的改进。

Key Images: The Problem with a Single Split

The most important images are on slide 9 (labeled “Figure” and “Page 20”).

  • Plot on the Left (Single Split): This graph shows the validation MSE for polynomial degrees 1 through 10, based on the single random split from the R code (slide 2). Just like the Python example, it shows that the MSE drops sharply from degree 1 to 2, and then stays relatively low. Based on this one chart, you might pick degree 2 (quadratic) as the best model.

**此图显示了多项式次数为 1 至 10 的验证均方误差,基于 R 代码(幻灯片 2)中的单次随机分割。与 Python 示例一样,它显示 MSE 从 1 阶到 2 阶急剧下降,然后保持在相对较低的水平。基于这张图,您可能会选择 2 阶(二次)作为最佳模型。

  • Plot on the Right (Ten Splits): This is the most critical plot. It shows the results of repeating the entire process 10 times, each with a new random split (from R code on slide 3).
    • You can see 10 different error curves.
    • While they all agree that degree 1 (linear) is bad, they do not agree on the best model. Some curves suggest degree 2 is best, others suggest 3, 4, or even 6.
    这是最关键的图表**。它显示了重复整个过程 10 次的结果,每次都使用新的随机分割(来自幻灯片 3 上的 R 代码)。
    • 您可以看到 10 条不同的误差曲线。
    • 虽然他们都认为 1 阶(线性)模型不好,但他们对最佳模型的看法并不一致。有些曲线表明 2 阶最佳,而另一些则表明 3 阶、4 阶甚至 6 阶最佳。

Summary of Drawbacks (Slides 7, 8, 9, 23, 25)

The slides repeatedly emphasize the two main drawbacks of this simple validation set approach:

  1. High Variability 高变异性: The estimated test MSE can be highly variable, depending on which observations happen to land in the training set versus the validation set. The plot with 10 curves (slide 9, right) proves this perfectly. 估计的测试 MSE 可能高度变异,具体取决于哪些观测值恰好落在训练集和验证集中。包含 10 条曲线的图表(幻灯片 9,右侧)完美地证明了这一点。

  2. Overestimation of Test Error 高估测试误差:

    • The model is only trained on a subset (e.g., 50%) of the available data. The validation data is “wasted” and not used for model building.
    • Statistical methods tend to perform worse when trained on fewer observations.
    • Therefore, the model trained on just the training set is likely worse than a model trained on the entire dataset.
    • This “worse” model will have a higher error rate on the validation set. This means the validation set MSE tends to overestimate the true test error you would get from a model trained on all your data.
    • 该模型仅基于可用数据的子集(例如 50%)进行训练。验证数据被“浪费”了,并未用于模型构建。
    • 统计方法在较少的观测值上进行训练时往往表现较差。
    • 因此,仅基于训练集训练的模型可能比基于整个数据集训练的模型更差
    • 这个“更差”的模型在验证集上的错误率会更高。这意味着验证集的 MSE 倾向于高估基于所有数据训练的模型的真实测试误差。

3. Cross-Validation: The Solution 交叉验证:解决方案

The slides introduce Cross-Validation (CV) as the method to overcome these drawbacks. The core idea is to use all data points for both training and validation, just at different times. 交叉验证 (CV),以此来克服这些缺点。其核心思想是将所有数据点用于训练和验证,只是使用的时间不同。

Leave-One-Out Cross-Validation (LOOCV) 留一法交叉验证 (LOOCV)

This is the first type of CV introduced (slide 10, page 26). For a dataset with \(n\) data points:

  1. Hold out the 1st data point (this is your validation set). 保留第一个数据点(这是你的验证集)。
  2. Train the model on the other \(n-1\) data points. 使用其他 \(n-1\) 个数据点训练模型。
  3. Calculate the error (e.g., \(\text{MSE}_1\)) using only that 1st held-out point. 仅使用第一个保留点计算误差(例如,\(\text{MSE}_1\))。
  4. Repeat this \(n\) times, holding out the 2nd point, then the 3rd, and so on, until every point has been used as the validation set exactly once. 重复此操作 \(n\) 次,保留第二个点,然后是第三个点,依此类推,直到每个点都作为验证集使用一次。
  5. Your final test error estimate is the average of all \(n\) errors. 最终的测试误差估计是所有 \(n\) 个误差的平均值

Key Formula (from Slide 10)

The formula for the \(n\)-fold LOOCV error estimate is: \(n\) 倍 LOOCV 误差估计公式为: \[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \text{MSE}_i\]

Where: * \(n\) is the total number of data points. 是数据点的总数。 * \(\text{MSE}_i\) is the Mean Squared Error calculated on the \(i\)-th data point when it was held out. 是保留第 \(i\) 个数据点时计算的均方误差。

3.What is LOOCV (Leave-One-Out Cross Validation)

Leave-One-Out Cross Validation (LOOCV) is a method for estimating the test error of a model. For a dataset with \(n\) observations, you: 留一交叉验证 (LOOCV) 是一种估算模型测试误差的方法。对于包含 \(n\) 个观测值的数据集,您需要:

  1. Fit the model \(n\) times. 对模型进行 \(n\) 次拟合
  2. For each fit \(i\) (from \(1\) to \(n\)), you train the model on all data points except for observation \(i\). 对于每个拟合 \(i\) 个样本(从 \(1\)\(n\)),您需要在除观测值 \(i\) 之外的所有数据点上训练模型。
  3. You then use this trained model to make a prediction for the single observation \(i\) that was left out. 然后,您需要使用这个训练好的模型对被遗漏的单个观测值 \(i\) 进行预测。
  4. The final LOOCV error is the average of the \(n\) prediction errors (typically the Mean Squared Error, or MSE). 最终的 LOOCV 误差是 \(n\) 个预测误差的平均值(通常为均方误差,简称 MSE)。

This process is shown visually in the slide titled “LOOCV” (slide 27), which is a key image for understanding the concept. Pros & Cons (from slide 28): * Pro: It has low bias because the training set (\(n-1\) samples) is almost identical to the full dataset.由于训练集(\(n-1\) 个样本)与完整数据集几乎完全相同,因此偏差较低。 * Pro: It produces a stable, non-random error estimate (unlike \(k\)-fold CV, which depends on the random fold assignments). 它能产生稳定的非随机误差估计(不同于 k 倍交叉验证,后者依赖于随机折叠分配)。 * Con: It can be extremely computationally expensive, as the model must be refit \(n\) times. 由于模型必须重新拟合 \(n\) 次,计算成本极其高昂。 * Con: The \(n\) error estimates can be highly correlated, which can sometimes lead to high variance in the final \(CV\) estimate. 这 \(n\) 个误差估计可能高度相关,有时会导致最终 \(CV\) 估计值出现较大方差。

Key Mathematical Formulas

The main challenge of LOOCV (being computationally expensive) has a very efficient solution for linear models. LOOCV 的主要挑战(计算成本高昂)对于线性模型来说,有一个非常有效的解决方案。

1. The Standard (Slow) Formula

As defined on slide 33, the LOOCV estimate of the MSE is:

\[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i^{(i)})^2\]

  • \(y_i\) is the true value of the \(i\)-th observation. 是第 \(i\) 个观测值的真实值。
  • \(\hat{y}_i^{(i)}\) is the predicted value for \(y_i\) from a model trained on all data except observation \(i\). 是使用除观测值 \(i\) 之外的所有数据训练的模型对 \(y_i\) 的预测值。

Calculating \(\hat{y}_i^{(i)}\) requires refitting the model \(n\) times. 计算 \(\hat{y}_i^{(i)}\) 需要重新拟合模型 \(n\) 次。

2. The Shortcut (Fast) Formula

Slide 34 provides a much simpler formula that only requires fitting the model once on the entire dataset: 只需对整个*数据集进行一次模型拟合**:

\[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2\]

  • \(\hat{y}_i\) is the prediction for \(y_i\) from the model trained on all \(n\) data points. 是使用所有 \(n\) 个数据点训练的模型对 \(y_i\) 的预测值。
  • \(h_i\) is the leverage of the \(i\)-th observation. 是第 \(i\) 个观测值的杠杆率

3. What is Leverage (\(h_i\))?

Slide 35 defines leverage:

  • Hat Matrix (\(\mathbf{H}\)): In a linear model, the fitted values \(\hat{\mathbf{y}}\) are related to the true values \(\mathbf{y}\) by the hat matrix: \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\).

  • Formula: The hat matrix is defined as \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\).

  • Leverage (\(h_i\)): The leverage for the \(i\)-th observation is simply the \(i\)-th diagonal element of the hat matrix, \(h_{ii}\) (often just written as \(h_i\)).

    • \(h_i = \mathbf{x}_i^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i\)
  • Meaning: Leverage measures how “influential” an observation’s \(x_i\) value is in determining its own predicted value \(\hat{y}_i\). A high leverage score means that point has a lot of influence on the model’s fit.

  • 帽子矩阵 (\(\mathbf{H}\)):在线性模型中,拟合值 \(\hat{\mathbf{y}}\) 与真实值 \(\mathbf{y}\) 之间存在帽子矩阵关系:\(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\)

  • 公式:帽子矩阵定义为 \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)

  • 杠杆率 (\(h_i\)):\(i\) 个观测值的杠杆率就是帽子矩阵的第 \(i\) 个对角线元素 \(h_{ii}\)(通常写为 \(h_i\))。

  • \(h_i = \mathbf{x}_i^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i\)

  • 含义:杠杆率衡量观测值的 \(x_i\) 值对其自身预测值 \(\hat{y}_i\) 的“影响力”。杠杆率得分高意味着该点对模型拟合有很大影响。

This shortcut formula is extremely important because it makes LOOCV as fast to compute as a single model fit.这个快捷公式非常重要,因为它使得 LOOCV 的计算速度与单个模型拟合一样快。

Python Code Explained (Slide 29)

This slide shows how to use LOOCV to select the best polynomial degree for predicting mpg from horsepower.

  1. Imports: It imports standard libraries (pandas, matplotlib) and key modules from sklearn:
    • LinearRegression: The model to be fit.
    • PolynomialFeatures: A tool to create polynomial terms (e.g., \(x, x^2, x^3\)).
    • LeaveOneOut: The LOOCV cross-validation strategy object.
    • cross_val_score: A function that automatically runs a cross-validation test.
  2. Setup:
    • It loads the Auto.csv data.
    • It defines \(X\) (horsepower) and \(y\) (mpg).
    • It creates a LeaveOneOut object: loo = LeaveOneOut().
  3. Looping through Degrees:
    • The code loops degree from 1 to 10.
    • make_pipeline: For each degree, it creates a model using make_pipeline. This pipeline is a crucial concept:
      • It first runs PolynomialFeatures(degree) to transform \(X\) into \([X, X^2, ..., X^{\text{degree}}]\).
      • It then feeds those features into LinearRegression() to fit the model.
    • cross_val_score: This is the most important line.
      • scores = cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error')
      • This function automatically does the entire LOOCV process. It takes the model (the pipeline), the data \(X\) and \(y\), and the CV strategy (cv=loo).
      • sklearn’s cross_val_score uses the “fast” leverage method internally for linear models, so it doesn’t actually fit the model \(n\) times.
      • It uses scoring='neg_mean_squared_error' because the scoring function assumes “higher is better.” By calculating the negative MSE, the best model will have the highest score (i.e., closest to 0).
    • Storing Results: It calculates the mean of the scores (which is the \(CV_{(n)}\)) and stores it.
  4. Visualization:
    • The code then plots the final cv_errors (after flipping the sign back to positive) against the degree.
    • The resulting plot (also on slide 32) shows the test MSE, allowing you to visually pick the best degree (where the error is minimized).
    • 生成的图(也在幻灯片 32 上)显示了测试 MSE,让您可以直观地选择最佳 degree(误差最小化的 degree)。

Important Images

  • Slide 27 (.../103628.png): This is the best conceptual image. It visually demonstrates how LOOCV splits the data \(n\) times, with each observation getting one turn as the validation set. 这是最佳概念图**。它直观地展示了 LOOCV 如何将数据拆分 \(n\) 次,每个观察值都会被旋转一次作为验证集。

  • Slide 34 (.../103711.png): This slide presents the most important formula: the “Easy formula” or shortcut, \(CV_{(n)} = \frac{1}{n} \sum (\frac{y_i - \hat{y}_i}{1 - h_i})^2\). This is the key takeaway for computing LOOCV efficiently in linear models. 这张幻灯片展示了最重要的公式**:“简单公式”或简称,\(CV_{(n)} = \frac{1}{n} \sum (\frac{y_i - \hat{y}_i}{1 - h_i})^2\)。这是在线性模型中高效计算 LOOCV 的关键要点。

  • Slide 32 (.../103701.jpg): This is the key results image. It contrasts the LOOCV error curve (left) with the 10-fold CV error curves (right). It clearly shows that LOOCV produces a single, stable error curve, while 10-fold CV results vary slightly each time it’s run due to the random data splits. 这是关键结果图**。它将 LOOCV 误差曲线(左)与 10 倍 CV 误差曲线(右)进行了对比。它清楚地表明,LOOCV 产生了单一、稳定的误差曲线,而由于数据分割的随机性,10 倍 CV 的结果每次运行时都会略有不同。

4. Cross-Validation Overview

These slides explain Cross-Validation (CV), a method used to estimate the test error of a model, helping to select the best level of flexibility (e.g., the best polynomial degree). It’s an improvement over a single validation set because it uses all the data for both training and validation at different times. 这是一种用于估算模型测试误差的方法,有助于选择最佳的灵活性(例如,最佳多项式次数)。它比单个验证集有所改进,因为它在不同时间使用所有数据进行训练和验证。

The two main types discussed are K-fold CV and Leave-One-Out CV (LOOCV). 主要讨论的两种类型是K 折交叉验证留一法交叉验证 (LOOCV)

K-Fold Cross-Validation K 折交叉验证

This is the most common method.

The Process

As shown in the slides, the K-fold CV process is: 1. Divide the dataset randomly into \(K\) non-overlapping groups (or “folds”), usually of equal size. Common choices are \(K=5\) or \(K=10\). 将数据集随机划分\(K\) 个不重叠的组(或“折”),通常大小相等。常见的选择是 \(K=5\)\(K=10\)。 2. Iterate \(K\) times: In each iteration \(i\), use the \(i\)-th fold as the validation set and all other \(K-1\) folds combined as the training set. 迭代 \(K\):在每次迭代 \(i\) 中,使用第 \(i\) 个样本集作为验证集,并将所有其他 \(K-1\) 个样本集合并作为训练集。 3. Calculate the Mean Squared Error (\(MSE_i\)) on the validation fold. 计算验证集的均方误差 (\(MSE_i\))。 4. Average all \(K\) error estimates to get the final CV score. 平均所有 \(K\) 个误差估计值,得到最终的 CV 分数。 ### Key Formula The final K-fold CV error estimate is the average of the errors from each fold: 最终的 K 折 CV 误差估计值是每个样本集误差的平均值: \[CV_{(K)} = \frac{1}{K} \sum_{i=1}^{K} MSE_i\]

Important Image: The Concept

The diagram in slide 104145.png is the most important for understanding the concept of K-fold CV. It shows a dataset split into 5 folds (\(K=5\)). The process is repeated 5 times, with a different fold (in beige) held out as the validation set in each run, while the rest (in blue) is used for training. 它展示了一个被分成 5 个样本集 (\(K=5\)) 的数据集。该过程重复 5 次,每次运行都会保留一个不同的折叠(米色)作为验证集,其余折叠(蓝色)用于训练。

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is just a special case of K-fold CV where \(K = n\) (the total number of observations). LOOCV 只是 K 折交叉验证的一个特例,其中 \(K = n\)(观测值总数)。 * You create \(n\) “folds,” each containing just one data point. 创建 \(n\) 个“折叠”,每个折叠仅包含一个数据点。 * You train the model \(n\) times, each time leaving out a single different observation and then calculating the error for that one point. 对模型进行 \(n\) 次训练,每次都省略一个不同的观测值,然后计算该点的误差。

Key Formulas

  1. Standard Definition: The LOOCV error is the average of the \(n\) squared errors: \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2\] where \(e_{[i]} = y_i - \hat{y}_{[i]}\) is the prediction error for the \(i\)-th observation, calculated from a model that was trained on all data except the \(i\)-th observation. This looks computationally expensive. LOOCV 误差是 \(n\) 个平方误差的平均值: \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2\] 其中 \(e_{[i]} = y_i - \hat{y}_{[i]}\) 是第 \(i\) 个观测值的预测误差,该误差由一个使用除第 \(i\) 个观测值以外的所有数据训练的模型计算得出。这看起来计算成本很高。

  2. Fast Computation (for Linear Regression): A key point from the slides is that for linear regression, you don’t need to re-fit the model \(N\) times. You can fit the model once on all \(N\) data points and use the following shortcut: \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\]

    • \(e_i = y_i - \hat{y}_i\) is the standard residual (from the model fit on all data).
    • \(h_i\) is the leverage statistic for the \(i\)-th observation (the \(i\)-th diagonal entry of the “hat matrix” \(H\)). This makes LOOCV as fast to compute as a single model fit. 对于线性回归,您无需重新拟合模型 \(N\) 次。您可以对所有 \(N\) 个数据点一次性地拟合模型,并使用以下快捷方式: \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\]
    • \(e_i = y_i - \hat{y}_i\) 是标准残差(来自对所有数据的模型拟合)。
    • \(h_i\) 是第 \(i\) 个观测值(“帽子矩阵”\(H\) 的第 \(i\) 个对角线元素)的杠杆统计量。 这使得 LOOCV 的计算速度与单次模型拟合一样快。

Python Code & Results

The Python code in slide 104156.jpg shows how to use 10-fold CV to find the best polynomial degree for a model.

Code Understanding (Slide 104156.jpg)

Here’s a breakdown of the key sklearn parts:

  1. from sklearn.pipeline import make_pipeline: This is used to chain steps. The pipeline make_pipeline(PolynomialFeatures(degree), LinearRegression()) first creates polynomial features (like \(x\), \(x^2\), \(x^3\)) and then fits a linear model to them.
  2. from sklearn.model_selection import KFold: This object is used to define the \(K\)-fold split strategy. kf = KFold(n_splits=10, shuffle=True, random_state=1) creates a 10-fold splitter that shuffles the data first.
  3. from sklearn.model_selection import cross_val_score: This is the most important function.
    • scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
    • This one function does all the work: it takes the model (the pipeline), the data X and y, and the CV splitter kf. It automatically trains and evaluates the model 10 times and returns an array of 10 scores (one for each fold).
    • scoring='neg_mean_squared_error' is used because cross_val_score expects a higher score to be better. Since we want to minimize MSE, we use negative MSE.
  4. avg_mse = -scores.mean(): The code averages the 10 scores and flips the sign back to positive to get the final CV (MSE) estimate for that polynomial degree.

Important Image: The Results

The plots in slides 104156.jpg (Python) and 104224.png (R) show the key result.

  • X-axis: Degree of Polynomial (model complexity).多项式的次数(模型复杂度)。
  • Y-axis: Estimated Test Error (CV Error / MSE).估计测试误差(CV 误差 / MSE)。
  • Interpretation: The plot shows a clear “U” shape. The error is high for degree 1 (a simple line), drops to its minimum at degree 2 (a quadratic \(ax^2 + bx + c\)), and then starts to rise again for higher degrees. This rise indicates overfitting—the more complex models are fitting the training data’s noise, leading to worse performance on unseen validation data. 该图呈现出清晰的“U”形。1 次(一条简单的直线)时误差较大,在2 次(二次 \(ax^2 + bx + c\))时降至最小,然后随着次数的增加,误差再次上升。这种上升表明过拟合——更复杂的模型会拟合训练数据的噪声,导致在未见过的验证数据上的性能下降。
  • Conclusion: The 10-fold CV analysis suggests that a quadratic model (degree 2) is the best choice, as it provides the lowest estimated test error. 10 倍 CV 分析表明二次模型(2 次)是最佳选择,因为它提供了最低的估计测试误差。

Let’s dive into the details of that proof.

Detailed Summary: The “Fast Computation of LOOCV” Proof

The most mathematically dense and important part of your slides is the proof (spanning slides 104126.jpg, 104132.png, and 104136.png) that LOOCV, which seems computationally very expensive, can be calculated quickly for linear regression. LOOCV 虽然计算成本看似非常高,但对于线性回归来说,它可以快速计算。 ### The Goal

The goal is to prove that the LOOCV statistic, which is defined as: \[CV = \frac{1}{N} \sum_{i=1}^{N} e_{[i]}^2 \quad \text{where } e_{[i]} = y_i - \hat{y}_{[i]}\] (Here, \(\hat{y}_{[i]}\) is the prediction for \(y_i\) from a model trained on all data except point \(i\)).(其中,\(\hat{y}_{[i]}\) 表示基于除点 \(i\) 之外的所有数据训练的模型对 \(y_i\) 的预测)。

…can be computed without re-fitting the model \(N\) times, using this “fast” formula: 无需重新拟合模型 \(N\) 次即可计算,使用以下“快速”公式: \[CV = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{e_i}{1 - h_i} \right)^2\] (Here, \(e_i\) is the standard residual and \(h_i\) is the leverage, both from a single model fit on all data).

The entire proof boils down to showing one identity: \(e_{[i]} = e_i / (1 - h_i)\).

Key Definitions (The Matrix Algebra Setup) (矩阵代数设置)

  • Model 模型: \(\mathbf{Y} = \mathbf{X}\beta + \mathbf{e}\)
  • Full Data Estimate 完整数据估计 (\(\hat{\beta}\)): \(\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\)
  • Hat Matrix 帽子矩阵 (\(\mathbf{H}\)): \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)
  • Full Data Residual 完整数据残差 (\(e_i\)): \(e_i = y_i - \hat{y}_i = y_i - \mathbf{x}_i^T\hat{\beta}\)
  • Leverage (\(h_i\)) 杠杆 (\(h_i\)): The \(i\)-th diagonal element of \(\mathbf{H}\). \(h_i = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\)
  • Leave-One-Out Estimate (\(\hat{\beta}_{[i]}\)): \(\hat{\beta}_{[i]} = (\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]}\)
    • \(\mathbf{X}_{[i]}\) and \(\mathbf{Y}_{[i]}\) are the data with the \(i\)-th row removed.
  • LOOCV Residual LOOCV 残差 (\(e_{[i]}\)): \(e_{[i]} = y_i - \mathbf{x}_i^T\hat{\beta}_{[i]}\)

The Proof Step-by-Step

Here is the logic from your slides, broken down:

Step 1: Relating the Matrices (Slide 104132.png)

The proof’s “trick” is to relate the “full data” matrix \((\mathbf{X}^T\mathbf{X})\) to the “leave-one-out” matrix \((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})\). 证明的“技巧”是将“全数据”矩阵 \((\mathbf{X}^T\mathbf{X})\) 与“留一法”矩阵 \((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})\) 关联起来。

  • The full sum-of-squares matrix is just the leave-one-out matrix plus the one observation’s contribution: 完整的平方和矩阵就是留一法矩阵加上一个观测值的贡献:

    \[\mathbf{X}^T\mathbf{X} = \mathbf{X}_{[i]}^T\mathbf{X}_{[i]} + \mathbf{x}_i\mathbf{x}_i^T\]

  • This means: \(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]} = \mathbf{X}^T\mathbf{X} - \mathbf{x}_i\mathbf{x}_i^T\)

Step 2: The Key Matrix Trick (Slide 104132.png)

We need the inverse \((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}\) to calculate \(\hat{\beta}_{[i]}\). Finding this inverse directly is hard. Instead, we use the Sherman-Morrison-Woodbury formula, which tells us how to find the inverse of a matrix that’s been “updated” (in this case, by subtracting \(\mathbf{x}_i\mathbf{x}_i^T\)).

我们需要逆\((\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1}\) 来计算 \(\hat{\beta}_{[i]}\)。直接求这个逆矩阵很困难。因此,我们使用 Sherman-Morrison-Woodbury 公式,它告诉我们如何求一个“更新”后的矩阵的逆矩阵(在本例中,是通过减去 \(\mathbf{x}_i\mathbf{x}_i^T\) 来实现的)。

The slide applies this formula to get: \[(\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_i}\] * This is the most complex step, but it’s a standard matrix identity. It’s crucial because it expresses the “leave-one-out” inverse in terms of the “full data” inverse \((\mathbf{X}^T\mathbf{X})^{-1}\), which we already have.

Step 3: Finding \(\hat{\beta}_{[i]}\) (Slide 104136.png)

Now we can write a new formula for \(\hat{\beta}_{[i]}\) by substituting the result from Step 2. We also note that \(\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]} = \mathbf{X}^T\mathbf{Y} - \mathbf{x}_i y_i\).

\[\hat{\beta}_{[i]} = (\mathbf{X}_{[i]}^T\mathbf{X}_{[i]})^{-1} (\mathbf{X}_{[i]}^T\mathbf{Y}_{[i]})\] \[\hat{\beta}_{[i]} = \left[ (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}}{1 - h_i} \right] (\mathbf{X}^T\mathbf{Y} - \mathbf{x}_i y_i)\]

The slide then shows the algebra to simplify this big expression. When you expand and simplify everything, you get a much cleaner result:

\[\hat{\beta}_{[i]} = \hat{\beta} - (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \frac{e_i}{1 - h_i}\] * This is a beautiful result! It says the LOOCV coefficient vector is just the full coefficient vector minus a small adjustment term related to the \(i\)-th observation’s residual (\(e_i\)) and leverage (\(h_i\)). * 这是一个非常棒的结果!它表明 LOOCV 系数向量就是完整的系数向量减去一个与第 \(i\) 个观测值的残差 (\(e_i\)) 和杠杆率 (\(h_i\)) 相关的小调整项。

Step 4: Finding \(e_{[i]}\) (Slide 104136.png)

This is the final step. We use the definition of \(e_{[i]}\) and the result from Step 3. 这是最后一步。我们使用 \(e_{[i]}\) 的定义和步骤 3 的结果。

  • Start with the definition: \(e_{[i]} = y_i - \mathbf{x}_i^T\hat{\beta}_{[i]}\)
  • Substitute \(\hat{\beta}_{[i]}\): \(e_{[i]} = y_i - \mathbf{x}_i^T \left[ \hat{\beta} - (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \frac{e_i}{1 - h_i} \right]\)
  • Distribute \(\mathbf{x}_i^T\): \(e_{[i]} = (y_i - \mathbf{x}_i^T\hat{\beta}) + \left( \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i \right) \frac{e_i}{1 - h_i}\)
  • Recognize the terms!
    • The first term is just the standard residual: \((y_i - \mathbf{x}_i^T\hat{\beta}) = e_i\)
    • The second term in parentheses is the definition of leverage: \((\mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i) = h_i\)
  • Substitute back: \(e_{[i]} = e_i + h_i \left( \frac{e_i}{1 - h_i} \right)\)
  • Get a common denominator: \(e_{[i]} = \frac{e_i(1 - h_i) + h_i e_i}{1 - h_i}\)
  • Simplify the numerator: \(e_{[i]} = \frac{e_i - e_ih_i + e_ih_i}{1 - h_i}\)

This gives the final, simple relationship: \[e_{[i]} = \frac{e_i}{1 - h_i}\]

Conclusion

By proving this identity, the slides show that to get all \(N\) of the “leave-one-out” errors, you only need to: 1. Fit one linear regression model on all the data. 2. Calculate the standard residuals \(e_i\) and the leverage values \(h_i\) for all \(N\) points. 3. Apply the formula \(e_i / (1 - h_i)\) for each point.

This turns a procedure that looked like it would take \(N\) times the work into a procedure that takes only 1 model fit. This is why LOOCV is a practical and efficient method for linear regression.

通过证明这个恒等式,幻灯片显示,要获得所有 \(N\) 个“留一法”误差,您只需: 1. 对所有数据拟合一个线性回归模型。 2. 计算所有 \(N\) 个点的标准残差 \(e_i\) 和杠杆值 \(h_i\)。 3. 对每个点应用公式 \(e_i / (1 - h_i)\)

这将一个看似需要 \(N\) 倍工作量的过程变成了只需 1 次模型拟合的过程。这就是为什么 LOOCV 是一种实用且高效的线性回归方法。

5. Main Goal of Cross-Validation 交叉验证的主要目标

The central purpose of cross-validation is to estimate the true test error of a machine learning model. This is crucial for:

  1. Model Assessment: Evaluating how well a model will perform on new, unseen data. 评估模型在新的、未见过的数据上的表现。
  2. Model Selection: Choosing the best level of model flexibility (e.g., the degree of a polynomial or the value of \(K\) in KNN) to avoid overfitting. 选择最佳的模型灵活性水平(例如,多项式的次数或 KNN 中的 \(K\) 值),以避免过拟合

As the slides show, training error (the error on the data the model was trained on) consistently decreases as model complexity increases. However, the test error follows a U-shape: it first decreases (as the model learns the true signal) and then increases (as the model starts fitting the noise, or “overfitting”). CV helps find the minimum point of this U-shaped test error curve. 训练误差(模型训练数据的误差)随着模型复杂度的增加而持续下降。然而,测试误差呈现 U 形:它先下降(当模型学习真实信号时),然后上升(当模型开始拟合噪声,即“过拟合”时)。交叉验证有助于找到这条 U 形测试误差曲线的最小值。

Important Images 🖼️

The most important image is on Slide 61.

These two plots perfectly illustrate the concept:

  • Blue Line (Training Error): Always goes down.
  • Brown Line (True Test Error): Forms a “U” shape. This is what we want to find the minimum of, but it’s unknown in practice.
  • Black Line (10-fold CV Error): This is our estimate of the test error. Notice how closely it tracks the brown line. The minimum of the CV curve (marked with an ‘x’) is very close to the minimum of the true test error.

This shows why CV works: it provides a reliable estimate to guide our choice of model (e.g., polynomial degree 3-4 for logistic regression, or \(K \approx 10\) for KNN).

  • 蓝线(训练误差):始终向下。
  • 棕线(真实测试误差):呈“U”形。这正是我们想要找到的最小值,但在实际应用中无法确定。
  • 黑线(10 倍 CV 误差):这是我们对测试误差的估计。注意它与棕线的吻合程度。CV 曲线的最小值(标有“x”)非常接近真实测试误差的最小值。

这说明了 CV 的原因:它提供了可靠的估计值来指导我们选择模型(例如,逻辑回归的多项式次数为 3-4,KNN 的 \(K \approx 10\))。

Key Formulas for Classification

For regression, we often use Mean Squared Error (MSE). For classification, the slides introduce the classification error rate.

For Leave-One-Out Cross-Validation (LOOCV), the error for a single observation \(i\) is: \[Err_i = I(y_i \neq \hat{y}_i^{(i)})\] * \(y_i\) is the true label for observation \(i\). * \(\hat{y}_i^{(i)}\) is the model’s prediction for observation \(i\) when the model was trained on all other observations except \(i\). * \(I(\dots)\) is an indicator function: it’s \(1\) if the condition is true (prediction is wrong) and \(0\) if false (prediction is correct).

The total CV error is simply the average of these individual errors, which is the overall fraction of incorrect classifications: \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\] The slides also show examples using Log Loss (Slide 64), which is another common and sensitive metric for classification. The logistic regression model itself is defined by: \[P(Y=1 | X) = \frac{1}{1 + \exp(-\beta_0 - \beta_1 X_1 - \beta_2 X_2 - \dots)}\]

对于回归,我们通常使用均方误差 (MSE)。对于分类,幻灯片介绍了分类错误率

对于留一交叉验证 (LOOCV),单个观测值 \(i\) 的误差为: \[Err_i = I(y_i \neq \hat{y}_i^{(i)})\] * \(y_i\) 是观测值 \(i\) 的真实标签。 * \(\hat{y}_i^{(i)}\) 是模型在除 \(i\) 之外的所有其他观测值上进行训练后,对观测值 \(i\) 的预测。 * \(I(\dots)\) 是一个指示函数:如果条件为真(预测错误),则为 \(1\);如果条件为假(预测正确),则为 \(0\)

CV误差只是这些单个误差的平均值,也就是错误分类的总体比例: \[CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\] 幻灯片还展示了使用对数损失(幻灯片64)的示例,这是另一个常见且敏感的分类指标。逻辑回归模型本身的定义如下: \[P(Y=1 | X) = \frac{1}{1 + \exp(-\beta_0 - \beta_1 X_1 - \beta_2 X_2 - \dots)}\]

Python Code Explained 🐍

The slides provide two key Python examples. Both manually implement K-fold cross-validation to show how it works.

1. KNN Regression (Slide 52) KNN 回归

  • Goal: Find the best n_neighbors (K) for a KNeighborsRegressor. 为 KNeighborsRegressor 找到最佳的 n_neighbors (K)。
  • Logic:
    1. It creates a KFold object to split the data into 10 folds (n_splits=10). 创建一个 KFold 对象,将数据拆分成 10 个折叠(n_splits=10)。
    2. It has an outer loop that iterates through different values of \(K\) (from 1 to 10). 它有一个 外循环,迭代不同的 \(K\) 值(从 1 到 10)。
    3. It has an inner loop that iterates through the 10 folds (for train_index, test_index in kfold.split(X)). 它有一个 内循环,迭代这 10 个折叠(for train_index, test_index in kfold.split(X))。
    4. Inside the inner loop:
      • It trains a KNeighborsRegressor on the 9 training folds (X_train, y_train).
      • It makes predictions on the 1 held-out test fold (X_test).
      • It calculates the mean squared error for that fold and stores it.
      • 在 9 个训练折叠(X_train, y_train)上训练 KNeighborsRegressor
      • 它对第一个保留的测试集 (X_test) 进行预测。
      • 它计算该集的均方误差并存储。
    5. After the inner loop: It averages the 10 error scores (one from each fold) to get the final CV error for that specific \(K\). 对 10 个误差分数(每个集一个)求平均值,得到该特定 \(K\) 的最终 CV 误差。
    6. The final plot shows this CV error vs. \(K\), allowing us to pick the \(K\) with the lowest error. 最终图表显示了 CV 误差与 \(K\) 的关系,使我们能够选择误差最小的 \(K\)

2. Logistic Regression with Polynomials (Slide 64) 使用多项式的逻辑回归

  • Goal: Find the best degree for PolynomialFeatures used with LogisticRegression.
  • Logic: This is very similar to the KNN example but uses a different model and error metric.
    1. It sets up a 10-fold split (kf = KFold(...)).
    2. An outer loop iterates through the degree \(d\) (from 1 to 10).
    3. An inner loop iterates through the 10 folds.
    4. Inside the inner loop:
      • It creates PolynomialFeatures of degree \(d\).
      • It transforms the 9 training folds (X_train) into polynomial features (X_train_poly).
      • It trains a LogisticRegression model on X_train_poly.
      • It transforms the 1 held-out test fold (X_test) using the same polynomial transformer.
      • It calculates the log_loss on the test fold.
    5. After the inner loop: It averages the 10 log_loss scores to get the final CV error for that degree.
    6. The plot shows CV error vs. degree, and the minimum is clearly at degree=3.

The Bias-Variance Trade-off in CV CV 中的偏差-方差权衡

This is a key theoretical point from Slide 54 that answers the questions on Slide 65. It compares LOOCV (\(K=n\)) with K-fold CV (\(K=5\) or \(10\)). 这是幻灯片 54中的一个关键理论点,它回答了幻灯片 65 中的问题。它比较了 LOOCV(K=n)和 K 倍 CV(K=5 或 10)。

  • LOOCV (K=n):
    • Bias: Very low. The model is trained on \(n-1\) samples, which is almost the full dataset. The resulting error estimate is nearly unbiased for the true test error. 该模型基于 \(n-1\) 个样本进行训练,这几乎是整个数据集。得到的误差估计对于真实测试误差几乎没有偏差。
    • Variance: Very high. You are training \(n\) models that are almost identical to each other (they only differ by one data point). Averaging these highly correlated error estimates doesn’t reduce the variance much, making the CV estimate unstable. 非常。您正在训练 \(n\) 个彼此几乎相同的模型(它们仅相差一个数据点)。对这些高度相关的误差估计求平均值并不能显著降低方差,从而导致 CV 估计不稳定。
  • K-Fold CV (K=5 or 10):
    • Bias: Slightly higher than LOOCV. The models are trained on, for example, 90% of the data. Since they are trained on less data, they might perform slightly worse. This means K-fold CV tends to slightly overestimate the true test error (Slide 66).
    • Variance: Much lower than LOOCV. The 10 models are trained on more different “chunks” of data (they overlap less), so their error estimates are less correlated. Averaging less-correlated estimates significantly reduces the overall variance.

Conclusion: We generally prefer 10-fold CV over LOOCV. It gives a much more stable (low-variance) estimate of the test error, even if it’s slightly more biased (overestimating the error, which is a safe/conservative estimate). 我们通常更喜欢10 倍交叉验证而不是 LOOCV。它能给出更稳定(低方差)的测试误差估计值,即使它的偏差略大(高估了误差,这是一个安全/保守的估计值)。

The Core Problem & Scenarios (Slides 47-51)

These slides use three scenarios to show why we need cross-validation (CV). The goal is to pick the right level of model flexibility (e.g., the degree of a polynomial or the complexity of a spline) to minimize the Test MSE (Mean Squared Error), which we can’t see in real life. 这些幻灯片使用了三种场景来说明为什么我们需要交叉验证 (CV)。目标是选择合适的模型灵活性(例如,多项式的次数或样条函数的复杂度),以最小化测试均方误差(Mean Squared Error),而这在现实生活中是无法观察到的。

  • The Curves (Slide 47): This slide is central.

    • True Test MSE (Blue) 真实测试均方误差(蓝色): This is the real error on new data. It has a U-shape. Error is high for simple models (high bias), drops as the model fits the data, and rises again for overly complex models (high variance, or overfitting). 这是新数据的真实误差。它呈U 形**。对于简单模型(高偏差),误差较高;随着模型拟合数据的深入,误差会下降;对于过于复杂的模型(高方差或过拟合),误差会再次上升。

    • LOOCV (Black Dashed) & 10-Fold CV (Orange) LOOCV(黑色虚线)和 10 倍 CV(橙色): These are our estimates of the true test MSE. Notice how closely they track the blue curve. The ‘x’ marks the minimum of the CV curve, which is our best guess for the model with the minimum test MSE. 这些是我们对真实测试 MSE 的估计。请注意它们与蓝色曲线的吻合程度。“x”标记 CV 曲线的最小值,这是我们对具有最小测试 MSE 的模型的最佳猜测

  • Scenario 1 (Slide 48): The true relationship is non-linear. The right-hand plot shows that the test MSE (red curve) is high for the simple linear model (blue square), but lower for the more flexible smoothing splines (teal squares). CV helps us find the “sweet spot.” 真实的关系是非线性的。右侧图表显示,对于简单的线性模型(蓝色方块),测试 MSE(红色曲线)较高,而对于更灵活的平滑样条函数(蓝绿色方块),测试 MSE 较低。CV 帮助我们找到“最佳点”。

  • Scenario 2 (Slide 49): The true relationship is linear. Here, the test MSE (red curve) is lowest for the simplest model (the linear one, blue square). CV correctly identifies this, and its error estimate (blue square) is lowest for that model. 真实的关系是线性的。在这里,对于最简单的模型(线性模型,蓝色方块),测试 MSE(红色曲线)最低。CV 正确地识别了这一点,并且其误差估计(蓝色方块)是该模型中最低的。

  • Scenario 3 (Slide 50): The true relationship is highly non-linear. The linear model (orange) is a very poor fit. The test MSE (red curve) is minimized by the most flexible model (teal square). CV again finds this. 真实的关系是高度非线性的。线性模型(橙色)拟合度很差。测试 MSE(红色曲线)被最灵活的模型(蓝绿色方块)最小化。CV 再次发现了这一点。

  • Key Takeaway (Slide 51): We use CV to find the tuning parameter (like polynomial degree) that minimizes the test error. We care less about the actual value of the CV error and more about where its minimum is. 我们使用 CV 来找到最小化测试误差的调整参数(例如多项式次数)。我们不太关心 CV 误差的实际值,而更关心它的最小值

CV for Classification (Slides 55-61)

This section shifts from regression (predicting a number, using MSE) to classification (predicting a category, like “blue” or “orange”). 本节从回归(使用 MSE 预测数字)转向分类(预测类别,例如“蓝色”或“橙色”)。

  • New Error Metric (Slide 55): We can’t use MSE. A natural choice is the classification error rate. 我们不能使用 MSE。一个自然的选择是分类错误率
    • \(Err_i = I(y_i \neq \hat{y}_i^{(i)})\)
    • This is an indicator function: it is 1 if the prediction for the \(i\)-th data point (when trained without it) is wrong, and 0 if it’s correct. 如果对第 \(i\) 个数据点的预测(在没有它的情况下训练时)错误,则为 1;如果正确,则为 0
    • The final CV error is just the average of these 0s and 1s, giving the total fraction of misclassified points: \(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\) 最终的 CV 误差就是这些 0 和 1 的平均值,即错误分类点的总比例:\(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i\)
  • The Example (Slides 56-61):
    • Slides 56-58: We are shown a “true” (but unknown) non-linear boundary (purple dashed line) separating two classes. We then try to estimate this boundary using logistic regression with different polynomial degrees (degree 1, 2, 3, 4). 我们看到了一条“真实”(但未知)的非线性边界(紫色虚线),它将两个类别分开。然后,我们尝试使用不同次数(1、2、3、4 次)的逻辑回归来估计这条边界。
    • Slides 59-60: This is a crucial point. In this simulated example, we do know the true test error rates. The true errors are [0.201, 0.197, 0.160, 0.162]. The lowest error is for the 3rd-degree polynomial. But in a real-world problem, we can never know these true errors. 这一点至关重要。在这个模拟示例中,我们确实知道真实的测试错误率。真实误差为 [0.201, 0.197, 0.160, 0.162]。最小误差出现在三次多项式中。但在实际问题中,我们永远无法知道这些真实误差
    • Slide 61 (The Solution): This is the most important image. It shows how CV solves the problem from slide 60.展示了 CV 如何解决幻灯片 60 中的问题。
      • Brown Curve (Test Error): This is the true test error (from slide 59). We can’t see this in practice. Its minimum is at degree 3. 这是真实的测试误差(来自幻灯片 59)。我们在实践中看不到它。它的最小值在 3 次方处。

      • Black Curve (10-fold CV Error): This is what we can calculate. It’s our estimate of the test error. Crucially, its minimum is also at degree 3.

      • 黑色曲线(10 倍 CV 误差):这是我们可以计算出来的。这是我们对测试误差的估计。至关重要的是,它的最小值也在 3 次方处。

      • This proves that CV successfully found the best model (degree 3) without ever seeing the true test error. The same logic is shown for the KNN classifier on the right.

      • 这证明 CV 成功地找到了最佳模型(3 次方),而从未看到真实的测试误差。右侧的 KNN 分类器也显示了相同的逻辑。

Python Code Explained (Slides 52, 63, 64)

The slides show how to manually implement K-fold CV. This is great for understanding, even though libraries like GridSearchCV can do this automatically.

  • KNN Regression (Slide 52):
    1. kfold = KFold(n_splits=10, ...): Creates an object that knows how to split the data into 10 folds.
    2. for n_k in neighbors:: This is the outer loop to test different \(K\) values (e.g., \(K\)=1, 2, 3…).
    3. for train_index, test_index in kfold.split(X):: This is the inner loop. For a single \(K\), it loops 10 times.
    4. Inside the inner loop:
      • It splits the data into a 9-fold training set (X_train) and a 1-fold test set (X_test).
      • It trains a KNeighborsRegressor on X_train.
      • It makes predictions on X_test and calculates the error (mean_squared_error).
    5. cv_errors.append(np.mean(mse_errors_k)): After the inner loop finishes 10 runs, it averages the 10 error scores for that \(K\) and stores it.
    6. The final plot shows cv_errors vs. neighbors, letting you pick the \(K\) with the lowest average error.
  • Logistic Regression Classification (Slides 63-64):
    • This code is almost identical, but with three key differences:
      1. The model is LogisticRegression.
      2. It uses PolynomialFeatures to create new features (\(X^2, X^3,\) etc.) inside the loop.
      3. The error metric is log_loss (a common, more sensitive metric than the simple 0/1 error rate).
    • The plot on slide 64 shows the 10-fold CV error (using Log Loss) vs. the Degree of the Polynomial. The minimum is clearly at Degree = 3, matching the finding from slide 61.

Answering the Key Questions (Slides 54 & 65)

Slide 65 asks two critical questions, which are answered directly by the concepts on Slide 54 (Bias and variance trade-off).

Q1: How does K affect the bias and variance of the CV error?

This refers to \(K\) in K-fold CV (not to be confused with \(K\) in KNN). K 如何影响 CV 误差的偏差和方差?

  • Bias:
    • LOOCV (K = n): This has very low bias. The model is trained on \(n-1\) samples, which is almost the full dataset. So, the error estimate \(CV_{(n)}\) is an almost-unbiased estimate of the true test error. ** 它的偏差非常低。该模型基于 \(n-1\) 个样本进行训练,这几乎是整个数据集。因此,误差估计 \(CV_{(n)}\) 是对真实测试误差的几乎无偏估计。

    • K-Fold (K < n, e.g., K=10): This has slightly higher bias. The models are trained on, for example, 90% of the data. Because they are trained on less data, they might perform slightly worse than a model trained on 100% of the data. This “pessimism” is the source of the bias. 偏差略高。例如,这些模型是基于 90% 的数据进行训练的。由于它们基于较少的数据进行训练,因此它们的性能可能会比基于 100% 数据进行训练的模型略差。这种“悲观”正是偏差的根源。

  • Variance:
    • LOOCV (K = n): This has very high variance. You are training \(n\) models that are almost identical (they only differ by one data point). Averaging \(n\) highly-correlated error estimates doesn’t reduce the variance much. This makes the final \(CV_{(n)}\) estimate unstable. 这种模型的方差非常高**。您正在训练 \(n\)几乎相同的模型(它们只有一个数据点不同)。对 \(n\) 个高度相关的误差估计取平均值并不能显著降低方差。这使得最终的 \(CV_{(n)}\) 估计值不稳定。

    • K-Fold (K < n, e.g., K=10): This has much lower variance. The 10 models are trained on more different “chunks” of data (they overlap less). Their error estimates are less correlated, and averaging 10 less-correlated numbers gives a much more stable (low-variance) final estimate. 这种模型的方差非常低**。这 10 个模型基于更多不同的数据“块”进行训练(它们重叠较少)。它们的误差估计值相关性较低,对 10 个相关性较低的数取平均值可以得到更稳定(低方差)的最终估计值。

Conclusion (The Trade-off): We prefer K-fold CV (K=5 or 10) over LOOCV. It gives a much more stable (low-variance) estimate, and we are willing to accept a tiny increase in bias to get it. 我们更喜欢K 倍交叉验证(K=5 或 10),而不是单倍交叉验证。它能给出更稳定(低方差)的估计值,并且我们愿意接受偏差的轻微增加来获得它。

Q2: Does Cross Validation over-estimate or under-estimate the true test error?

交叉验证会高估还是低估真实测试误差?

Based on the bias discussion above:

Cross-validation (especially K-fold) generally over-estimates the true test error. 交叉验证(尤其是 K 倍交叉验证)通常会高估真实测试误差

Reasoning: 1. The “true test error” is the error of a model trained on the entire dataset (\(n\) samples). 2. K-fold CV trains its models on subsets of the data (e.g., \(n \times (K-1)/K\) samples). 3. Since these models are trained on less data, they are (on average) slightly worse than the final model trained on all the data. 4. Because the CV models are slightly worse, their error rates will be slightly higher. 5. Therefore, the final CV error score is a slightly “pessimistic” or high estimate. This is considered a good thing, as it’s a conservative estimate of how our model will perform. 理由: 1. “真实测试误差”是指在整个数据集\(n\) 个样本)上训练的模型的误差。 2. K 折交叉验证 (K-fold CV) 在数据子集上训练其模型(例如,\(n \times (K-1)/K\) 个样本)。 3. 由于这些模型基于较少的数据进行训练,因此它们(平均而言)比基于所有数据训练的最终模型略差。 4. 由于 CV 模型略差,其错误率会略高。 5. 因此,最终的 CV 错误率是一个略微“悲观”或偏高的估计。这被认为是一件好事,因为它是对模型性能的保守*估计。

6. Summary of Bootstrap

Bootstrap is a resampling technique used to estimate the uncertainty (like standard error or confidence intervals) of a statistic. Its key idea is to treat your original data sample as a proxy for the true population. It then simulates the process of drawing new samples by instead sampling with replacement from your original sample. Bootstrap 是一种重采样技术,用于估计统计数据的不确定性(例如标准误差或置信区间)。其核心思想是将原始数据样本视为真实总体的替代样本。然后,它通过从原始样本中进行有放回的抽样来模拟抽取新样本的过程。

The Problem

You have a single data sample (e.g., \(n=100\) people) and you calculate a statistic, like the sample mean (\(\bar{x}\)) or a regression coefficient (\(\hat{\beta}\)). You want to know how accurate this statistic is. How much would it vary if you could repeat your experiment many times? This variation is measured by the standard error (SE). 您有一个数据样本(例如,\(n=100\) 人),并计算一个统计数据,例如样本均值 (\(\bar{x}\)) 或回归系数 (\(\hat{\beta}\))。您想知道这个统计数据的准确度。如果可以多次重复实验,它会有多少变化?这种变化可以用标准误差 (SE) 来衡量。

The Bootstrap Solution

Since you can’t re-run the whole experiment, you simulate it using the one sample you have. 由于您无法重新运行整个实验,因此您可以使用现有的一个样本进行“模拟”。

The Process: 1. Original Sample (\(Z\)) 原始样本 (\(Z\)): You have your one dataset with \(n\) observations. 2. Bootstrap Sample (\(Z^{*1}\)) Bootstrap 样本 (\(Z^{*1}\)): Create a new dataset of size \(n\) by randomly pulling observations from your original sample with replacement. (This means some original observations will be picked multiple times, and some not at all). 3. Calculate Statistic (\(\hat{\theta}^{*1}\)) 计算统计量 (\(\hat{\theta}^{*1}\)): Calculate your statistic of interest (e.g., the mean, \(\hat{\alpha}\), regression coefficients) on this new bootstrap sample. 4. Repeat 重复: Repeat steps 2 and 3 a large number of times (\(B\), e.g., \(B=1000\)). This gives you \(B\) bootstrap statistics: \(\hat{\theta}^{*1}, \hat{\theta}^{*2}, ..., \hat{\theta}^{*B}\). 5. Analyze the Bootstrap Distribution 分析自举分布: This collection of \(B\) statistics is your “bootstrap distribution.” * Standard Error 标准误差: The standard deviation of this bootstrap distribution is your estimate of the standard error of your original statistic. * Confidence Interval 置信区间: A 95% confidence interval can be found by taking the 2.5th and 97.5th percentiles of this bootstrap distribution.

Why use it? It’s powerful because it doesn’t rely on strong theoretical assumptions (like data being normally distributed). It can be applied to almost any statistic, even very complex ones (like the prediction from a KNN model), for which a simple mathematical formula for standard error doesn’t exist. 它非常强大,因为它不依赖于严格的理论假设(例如数据服从正态分布)。它几乎可以应用于任何统计数据,即使是非常复杂的统计数据(例如 KNN 模型的预测),因为这些统计数据没有简单的标准误差数学公式。

Mathematical Understanding

The core idea is to use the empirical distribution (your sample) as an estimate for the true population distribution. 其核心思想是使用经验分布(你的样本)来估计真实的总体分布

Example: Estimating \(\alpha\)

Your slides provide an example of finding the \(\alpha\) that minimizes the variance of a portfolio, \(var(\alpha X + (1-\alpha)Y)\). 用于计算使投资组合方差最小化的 \(\alpha\),即 \(var(\alpha X + (1-\alpha)Y)\)

  1. True Population Parameter (\(\alpha\)) 真实总体参数 (\(\alpha\)): The true \(\alpha\) is a function of the population variances and covariance: 真实 \(\alpha\)总体方差和协方差的函数: \[\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 - 2\sigma_{XY}}\] We can never know this value exactly unless we know the entire population. 除非我们了解整个总体,否则我们永远无法准确知道这个值。

  2. Sample Statistic (\(\hat{\alpha}\)) 样本统计量 (\(\hat{\alpha}\)): We estimate \(\alpha\) using our sample, creating the statistic \(\hat{\alpha}\) by plugging in our sample variances and covariance: 我们使用样本估计 \(\alpha\),通过代入样本方差和协方差来创建统计量 \(\hat{\alpha}\)\[\hat{\alpha} = \frac{\hat{\sigma}_Y^2 - \hat{\sigma}_{XY}}{\hat{\sigma}_X^2 + \hat{\sigma}_Y^2 - 2\hat{\sigma}_{XY}}\] This \(\hat{\alpha}\) is just one number from our single sample. How confident are we in it? We need its standard error, \(SE(\hat{\alpha})\). 这个 \(\hat{\alpha}\) 只是我们单个样本中的一个数字。我们对它的置信度有多高?我们需要它的标准误差,\(SE(\hat{\alpha})\)

  3. Bootstrap Statistic (\(\hat{\alpha}^*\)) 自举统计量 (\(\hat{\alpha}^*\)): We apply the bootstrap process:

    • Create a bootstrap sample (by resampling with replacement). 创建一个自举样本(通过放回重采样)。
    • Calculate \(\hat{\alpha}^*\) using the sample (co)variances of this new bootstrap sample. 使用这个新自举样本的样本(协)方差计算 \(\hat{\alpha}^*\)
    • Repeat \(B\) times to get \(B\) values: \(\hat{\alpha}^{*1}, \hat{\alpha}^{*2}, ..., \hat{\alpha}^{*B}\). 重复 \(B\) 次,得到 \(B\) 个值:\(\hat{\alpha}^{*1}, \hat{\alpha}^{*2}, ..., \hat{\alpha}^{*B}\)
  4. Estimating the Standard Error 估算标准误差: The standard error of our original estimate \(\hat{\alpha}\) is estimated by the standard deviation of all our bootstrap estimates: 我们原始估计值 \(\hat{\alpha}\) 的标准误差是通过所有自举估计值的标准差来“估算”的: \[SE_{boot}(\hat{\alpha}) = \sqrt{\frac{1}{B-1} \sum_{j=1}^{B} (\hat{\alpha}^{*j} - \bar{\alpha}^*)^2}\] where \(\bar{\alpha}^*\) is the average of all \(B\) bootstrap estimates. \(\bar{\alpha}^*\) 是所有 \(B\) 个自举估计值的平均值。

The slides (p. 73, 77-78) show this visually. The “sampling from population” histogram (left) is the true sampling distribution, which we can only create in a simulation. The “Bootstrap” histogram (right) is the bootstrap distribution created from one sample. They look very similar, which shows the method works. “从总体抽样”直方图(左图)是真实的抽样分布,我们只能在模拟中创建它。“自举”直方图(右图)是从一个样本创建的自举分布。它们看起来非常相似,这表明该方法有效。

Code Analysis

R: \(\alpha\) Example (Slides 75 & 77)

  • Slide 75 (The R code): This is a SIMULATION, not Bootstrap.
    • for(i in 1:m){...}: This loop runs m=1000 times.
    • returns <- rmvnorm(...): Inside the loop, it draws a brand new sample from the true population every time.
    • alpha[i] <- ...: It calculates \(\hat{\alpha}\) for each new sample.
    • Purpose: This code shows the true sampling distribution of \(\hat{\alpha}\) (the “Histogram of alpha”). You can only do this if you know the true population, as in a simulation.
  • Slide 77 (The R code): This IS Bootstrap.
    • returns <- rmvnorm(...): Outside the loop, this is done only once to get one original sample.
    • for(i in 1:B){...}: This is the bootstrap loop.
    • sample(1:nrow(returns), n, replace = T): This is the key line. It randomly selects row numbers with replacement from the single returns dataset.
    • returns_boot <- returns[sample(...), ]: This creates the bootstrap sample.
    • alpha_bootstrap[i] <- ...: It calculates \(\hat{\alpha}^*\) on the returns_boot sample.
    • Purpose: This code generates the bootstrap distribution (the “Bootstrap” histogram on slide 78) to estimate the true sampling distribution.

R: Linear Regression Example (Slides 79 & 81)

  • Slide 79:
    • boot.fn <- function(data, index){ ... }: Defines a function that the boot package needs. It takes data and an index vector.
    • lm(mpg~horsepower, data=data, subset=index): This is the core. It fits a linear model only on the data points specified by the index. The boot function will automatically supply this index as a resampled-with-replacement vector.
    • boot(Auto, boot.fn, R=1000): This runs the bootstrap. It calls boot.fn 1000 times, each time with a new resampled index, and collects the coefficients.
  • Slide 81:
    • summary(lm(...)): Shows the standard output. The “Std. Error” column (e.g., 0.860, 0.006) is calculated using mathematical theory.
    • boot.res: Shows the bootstrap output. The “std. error” column (e.g., 0.841, 0.007) is the standard deviation of the 1000 bootstrap estimates.
    • Main Point: The standard errors from the bootstrap are very close to the theoretical ones. This confirms the uncertainty. If the model assumptions were violated, the bootstrap SE would be more trustworthy.
    • The histograms show the bootstrap distributions for the intercept (t1*) and the slope (t2*). The arrows show the 95% percentile confidence interval.

Python: KNN Regression Example (Slide 80)

This shows how to get a confidence interval for a single prediction.

  • for i in range(n_bootstraps):: The bootstrap loop.
  • indices = np.random.choice(train_samples.shape[0], train_samples.shape[0], replace=True): This is the key line in Python (like sample in R). It gets a new set of indices with replacement.
  • X_boot, y_boot = ...: Creates the bootstrap sample.
  • model.fit(X_boot, y_boot): A new KNN model is trained on this bootstrap sample.
  • bootstrap_preds.append(model.predict(predict_point)): The model (trained on \(Z^{*i}\)) makes a prediction for the same fixed point. This is repeated 1000 times.
  • Result: You get a distribution of predictions for that one point. The 2.5th and 97.5th percentiles of this distribution give you a 95% confidence interval for that specific prediction. 你会得到该点的预测分布。该分布的 2.5 和 97.5 百分位数为该特定预测提供了 95% 的置信区间。

Python: KNN on Auto data (Slide 82)

  • BE CAREFUL: This slide does NOT show Bootstrap. It shows K-Fold Cross-Validation (CV).
  • Purpose: The goal here is not to find uncertainty. The goal is to find the best hyperparameter (the best value for \(k\), the number of neighbors).
  • Method:
    • kf = KFold(n_splits=10): Splits the data into 10 chunks (“folds”).
    • for train_index, test_index in kf.split(X):: It loops 10 times. Each time, it trains on 9 chunks and tests on 1 chunk.
  • Key Difference for Exam:
    • Bootstrap: Samples with replacement to estimate uncertainty/standard error.
    • Cross-Validation: Splits data without replacement into \(K\) folds to estimate model performance/prediction error and tune hyperparameters.
    • 自举法:使用有放回的样本来估计不确定性/标准误差
    • 交叉验证:将数据无放回地分成 \(K\) 份,以估计模型性能/预测误差并调整超参数。

7. The mathematical theory of Bootstrap and the extension to Cross-Validation (CV).

1. Code Analysis: Bootstrap for a KNN Prediction (Slide 85)

This Python code shows a different use of bootstrap: finding the confidence interval for a single prediction, not for a model coefficient.

  • Goal: To estimate the uncertainty of a KNN model’s prediction for a specific new data point (predict_point).
  • Process:
    1. Train Full Model: A KNN model (knn) is first trained on the entire dataset. It makes one prediction (knpred) for predict_point. This is our \(\hat{f}(x_0)\).
    2. Bootstrap Loop (for i in range(n_bootstraps)):
      • indices = np.random.choice(...): This is the core bootstrap step. It creates a new list of indices by sampling with replacement from the original data.
      • X_boot, y_boot = ...: This creates the new bootstrap dataset (\(Z^{*i}\)).
      • km.fit(X_boot, y_boot): A new KNN model (km) is trained only on this bootstrap sample.
      • bootstrap_preds.append(km.predict(predict_point)): This newly trained model makes a prediction for the same predict_point. This value is \(\hat{f}^{*i}(x_0)\).
    3. Analyze Distribution: After 1000 loops, bootstrap_preds contains 1000 different predictions for the same point.
    4. Confidence Interval:
      • np.percentile(bootstrap_preds, [2.5, 97.5]): This finds the 2.5th and 97.5th percentiles of the 1000 bootstrap predictions.
      • The resulting [lower_bound, upper_bound] (e.g., [13.70, 15.70]) forms the 95% confidence interval for the prediction.
  • Histogram Plot: The plot on the right visually confirms this. It shows the distribution of the 1000 bootstrap predictions, with the 95% confidence interval marked by the red dashed lines.

2. Mathematical Understanding: Why Does Bootstrap Work? (Slides 87-88)

This is the theoretical justification for the entire method. It’s based on an analogy. 这是整个方法的理论依据。它基于一个类比。

The “True” World (Slide 87, Top)

  • Population: There is a true, unknown population distribution \(F\). 存在一个真实的、未知的总体分布 \(F\)

  • Parameter: We want to know a true parameter, \(\theta\), which is a function of \(F\) (e.g., the true population mean). 我们想知道一个真实的参数 \(\theta\),它是 \(F\) 的函数(例如,真实的总体均值)。

  • Sample: We get one sample \(X_1, ..., X_n\) from \(F\). 我们从 \(F\) 中获取一个样本 \(X_1, ..., X_n\)

  • Statistic: We calculate our best estimate \(\hat{\theta}\) from our sample. (e.g., the sample mean \(\bar{x}\)). \(\hat{\theta}\) is our proxy for \(\theta\). 我们从样本中计算出最佳估计值 \(\hat{\theta}\)。(例如,样本均值 \(\bar{x}\))。\(\hat{\theta}\)\(\theta\) 的替代值。

  • The Problem: We want to know the accuracy of \(\hat{\theta}\). How much would \(\hat{\theta}\) vary if we could draw many samples? We want the sampling distribution of \(\hat{\theta}\) around \(\theta\), specifically the distribution of the error: \((\hat{\theta} - \theta)\). 我们想知道 \(\hat{\theta}\) 的准确率。如果我们可以抽取多个样本,\(\hat{\theta}\) 会有多少变化?我们想要 \(\hat{\theta}\) 围绕 \(\theta\)抽样分布,具体来说是误差的分布:\((\hat{\theta} - \theta)\)

  • CLT: The Central Limit Theorem states that \(\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{\text{dist}} N(0, Var_F(\theta))\).

  • 中心极限定理\(\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{\text{dist}} N(0, Var_F(\theta))\)

  • The Catch: This is UNKNOWN because we don’t know \(F\).这是未知的,因为我们不知道 \(F\)

The “Bootstrap” World (Slide 87, Bottom)

  • Population: We pretend our original sample is the population. We call its distribution the “empirical distribution,” \(\hat{F}_n\). 我们假设原始样本就是总体。我们称其分布为“经验分布”,即 \(\hat{F}_n\)
  • Parameter: In this new world, the “true” parameter is our original statistic, \(\hat{\theta}\) (which is a function of \(\hat{F}_n\)). 在这个新世界中,“真实”参数是我们原始的统计量 \(\hat{\theta}\)(它是 \(\hat{F}_n\) 的函数)。
  • Sample: We draw many bootstrap samples \(X_1^*, ..., X_n^*\) from \(\hat{F}_n\) (i.e., sampling with replacement from our original sample). 我们从 \(\hat{F}_n\)* 中抽取 许多 自举样本 \(X_1^*, ..., X_n^*\)(即从原始样本中进行 有放回 抽样)。
  • Statistic: From each bootstrap sample, we calculate a bootstrap statistic, \(\hat{\theta}^*\). 从每个自举样本中,我们计算一个 自举统计量,即 \(\hat{\theta}^*\)
  • The Solution: We can now empirically find the distribution of \(\hat{\theta}^*\) around \(\hat{\theta}\). We look at the distribution of the bootstrap error: \((\hat{\theta}^* - \hat{\theta})\). 我们现在可以 凭经验 找到 \(\hat{\theta}^*\) 围绕 \(\hat{\theta}\) 的分布。我们来看看自举误差的分布:\((\hat{\theta}^* - \hat{\theta})\)
  • CLT: The CLT also states that \(\sqrt{n}(\hat{\theta}^* - \hat{\theta}) \xrightarrow{\text{dist}} N(0, Var_{\hat{F}_n}(\theta))\).
  • The Power: This distribution is ESTIMABLE! We just run the bootstrap \(B\) times and we get \(B\) values of \(\hat{\theta}^*\). We can then calculate their variance, standard deviation, and percentiles directly. 这个分布是可估计的!我们只需运行 \(B\) 次自举程序,就能得到 \(B\)\(\hat{\theta}^*\) 值。然后我们可以直接计算它们的方差、标准差和百分位数。

The Core Approximation (Slide 88)

The entire method relies on the assumption that the (knowable) bootstrap distribution is a good approximation of the (unknown) true sampling distribution. 整个方法依赖于以下假设:(已知的)自举分布能够很好地近似(未知的)真实抽样分布

The distribution of the bootstrap error approximates the distribution of the true error. 自举误差的分布近似于真实误差的分布。

\[\text{distribution of } \sqrt{n}(\hat{\theta}^* - \hat{\theta}) \approx \text{distribution of } \sqrt{n}(\hat{\theta} - \theta)\]

This is why: * The standard deviation of the \(\hat{\theta}^*\) values is our estimate for the standard error of \(\hat{\theta}\). 值的标准差是我们对 \(\hat{\theta}\)标准误差的估计值。 * The percentiles of the \(\hat{\theta}^*\) distribution (e.g., 2.5th and 97.5th) can be used to build a confidence interval for the true parameter \(\theta\). 分布的百分位数(例如,第 2.5 个和第 97.5 个)可用于为真实参数 \(\theta\) 建立置信区间

3. Extension: Cross-Validation (CV) Analysis

CV for Hyperparameter Tuning (Slide 84) 超参数调优的 CV

This plot is the result of the 10-fold CV code shown in the previous set of slides (slide 82). * Purpose: To find the optimal hyperparameter \(k\) (number of neighbors) for the KNN model. * X-axis: Number of Neighbors (\(k\)). * Y-axis: CV Error (Mean Squared Error). * Analysis: * Low \(k\) (e.g., \(k=1, 2\)): High error. The model is too complex and overfitting to the training data. * High \(k\) (e.g., \(k>40\)): Error slowly increases. The model is too simple and underfitting (e.g., averaging too many neighbors). * Optimal \(k\): The “sweet spot” is at the bottom of the “U” shape, around \(k \approx 20-30\), which gives the lowest CV error.

  • 目的:为 KNN 模型找到最优超参数 \(k\)(邻居数)。
  • X 轴:邻居数 (\(k\))。
  • Y 轴:CV 误差(均方误差)。
  • 分析:**
  • \(k\)(例如,\(k=1, 2\)):误差较大。模型过于复杂,并且与训练数据过拟合
  • \(k\)(例如,\(k>40\)):误差缓慢增加。模型过于简单且欠拟合(例如,对太多邻居进行平均)。
  • 最优 \(k\)“最佳点”位于“U”形的底部,大约为\(k \approx 20-30\),此时 CV 误差最低。

Why CV Over-Estimates Test Error (Slide 89)

This is a subtle but important theoretical point. * Our Goal: We want to know the test error of our final model (\(\hat{f}^{\text{full}}\)), which we will train on the full dataset (all \(n\) observations). 我们想知道最终模型 (\(\hat{f}^{\text{full}}\)) 的测试误差,我们将在完整数据集(所有 \(n\) 个观测值)上训练该模型。 * What CV Measures: \(k\)-fold CV does not test the final model. It tests \(k\) different models (\(\hat{f}^{(k)}\)), each trained on a smaller dataset (of size \(\frac{k-1}{k} \times n\)). \(k\) 倍 CV 测试最终模型。它测试了 \(k\) 个不同的模型 (\(\hat{f}^{(k)}\)),每个模型都基于一个较小的数据集(大小为 \(\frac{k-1}{k} \times n\))进行训练。

  • The Logic:
    1. Models trained on less data generally perform worse than models trained on more data. 基于较少数据训练的模型通常比基于较多数据训练的模型表现更差
    2. The CV error is the average error of models trained on \(\frac{k-1}{k} n\) observations. CV 误差是使用 \(\frac{k-1}{k} n\) 个观测值训练的模型的平均误差。
    3. The “true test error” is the error of the model trained on \(n\) observations. “真实测试误差”是使用 \(n\) 个观测值训练的模型的误差。
  • Conclusion: Since the CV models are trained on smaller datasets, they will, on average, have a slightly higher error than the final model. Therefore, the CV error score is a slightly pessimistic estimate (it over-estimates) the true test error of the final model. 由于 CV 模型是在较小的数据集上训练的,因此它们的平均误差会略高于最终模型。因此,CV 误差分数是一个略微悲观的估计(它高估了)最终模型的真实测试误差。

Correction of CV Error (Slides 90-91)

  • Theory (Slide 91): Advanced theory suggests the expected test error \(R(n)\) behaves like \(R(n) = R^* + c/n\), where \(R^*\) is the irreducible error and \(n\) is the sample size. This formula mathematically confirms that error decreases as sample size \(n\) increases. 高级理论表明,预期测试误差 \(R(n)\) 的行为类似于 \(R(n) = R^* + c/n\),其中 \(R^*\) 是不可约误差,\(n\) 是样本量。该公式从数学上证实了误差会随着样本量 \(n\) 的增加而减小

  • R Code (Slide 90): The cv.glm function from the boot library automatically provides this.

    • cv.err$delta: This output vector contains two values.
    • [1] 24.23151 (Raw CV Error): This is the standard Leave-One-Out CV (LOOCV) error.
    • [2] 24.23114 (Adjusted CV Error): This is a bias-corrected estimate that accounts for the overestimation problem. It’s slightly lower, representing a more accurate guess for the error of the final model trained on all \(n\) data points.

# The “Correction of CV Error” extension.

Summary

This section provides a deeper mathematical look at why k-fold cross-validation (CV) slightly over-estimates the true test error. 本节从数学角度更深入地阐述了 为什么 k 折交叉验证 (CV) 会略微高估真实测试误差。

  1. The Overestimation 高估: CV trains on \(\frac{k-1}{k}\) of the data, which is less than the full dataset (size \(n\)). Models trained on less data are generally worse. Therefore, the average error from CV (\(CV_k\)) is slightly higher (more pessimistic) than the true error of the final model trained on all \(n\) data (\(R(n)\)). CV 训练的数据为 \(\frac{k-1}{k}\),小于完整数据集(大小为 \(n\))。使用较少数据训练的模型通常更差。因此,CV 的平均误差 (\(CV_k\)) 略高于(更悲观地)基于所有 \(n\) 个数据训练的最终模型的真实误差 (\(R(n)\))。

  2. A Simple Correction 简单修正: A mathematical formula, \(\tilde{CV_k} = \frac{k-1}{k} \cdot CV_k\), is proposed to “correct” this overestimation.

  3. The Critical Flaw 关键缺陷: This correction is derived assuming the irreducible error (\(R^*\)) is zero.此修正是在假设不可约误差 (\(R^*\)) 为零的情况下得出的。

  4. The Takeaway 要点 (Code Analysis): The Python code demonstrates a real-world scenario where there is noise (noise_std = 0.5), meaning \(R^* > 0\). In this case, the simple correction fails—it produces an error (0.217) that is less accurate and further from the true error (0.272) than the original raw CV error (0.271).

Python 代码演示了一个存在噪声(noise_std = 0.5)的真实场景,即 \(R^* > 0\)。在这种情况下,简单修正失败——它产生的误差 (0.217) 精度较低,并且与真实误差 (0.272) 的距离比原始 CV 误差 (0.271) 更远。

Exam Conclusion: For most real-world problems (which have noise), the raw \(k\)-fold CV error is a better and more reliable estimate of the true test error than the simple (and flawed) correction. 对于大多数实际问题(包含噪声),原始 \(k\) 倍 CV 误差比简单(且有缺陷的)修正方法更能准确、可靠地估计真实测试误差

Mathematical Understanding

This section explains the theory of why \(CV_k > R(n)\) and derives the simple correction. 本节解释了为什么 \(CV_k > R(n)\),并推导出简单的修正方法。

  1. Assumed Error Behavior 假设误差行为: We assume the test error \(R(n)\) for a model trained on \(n\) data points behaves like: 我们假设基于 \(n\) 个数据点训练的模型的测试误差 \(R(n)\) 的行为如下: \[R(n) = R^* + \frac{c}{n}\]

    • \(R^*\): The irreducible error (the “noise floor” you can never beat). 不可约误差(即你永远无法克服的“本底噪声”)。
    • \(c/n\): The model variance, which decreases as sample size \(n\) increases. 模型方差,随着样本量 \(n\) 的增加而减小。
  2. Test Error vs. CV Error 测试误差 vs. CV 误差:

    • Test Error of Interest: This is the error of our final model trained on all \(n\) points: \[R(n) = R^* + \frac{c}{n}\]
    • 感兴趣的测试误差:这是我们在所有 \(n\) 个点上训练的最终模型的误差:
    • k-fold CV Error: This is the average error of \(k\) models, each trained on a smaller sample of size \(n' = (\frac{k-1}{k})n\).
    • k 倍 CV 误差:这是 \(k\) 个模型的平均误差,每个模型都使用一个较小的样本(大小为 \(n' = (\frac{k-1}{k})n\))进行训练。 \[CV_k \approx R(n') = R\left(\frac{k-1}{k}n\right) = R^* + \frac{c}{\left(\frac{k-1}{k}\right)n} = R^* + \frac{ck}{(k-1)n}\]
  3. The Overestimation 高估: Let’s compare \(CV_k\) and \(R(n)\): \[CV_k \approx R^* + \left(\frac{k}{k-1}\right) \frac{c}{n}\] \[R(n) = R^* + \left(\frac{k-1}{k-1}\right) \frac{c}{n}\] Since \(k > (k-1)\), the factor \(\left(\frac{k}{k-1}\right)\) is greater than 1. This means the \(CV_k\) error term is larger than the \(R(n)\) error term. Thus: \(CV_k > \text{Test error of interest } R(n)\) 由于 \(k > (k-1)\),因子 \(\left(\frac{k}{k-1}\right)\) 大于 1。这意味着 \(CV_k\) 误差项大于 \(R(n)\) 误差项。因此: \(CV_k > \text{目标测试误差 } R(n)\)

  4. Deriving the (Flawed) Correction 推导(有缺陷的)修正: This correction makes a strong assumption: \(R^* \approx 0\) (the model is perfectly specified, and there is no noise). 此修正基于一个强假设:\(R^* \approx 0\)(模型完全正确,且无噪声)。

    • If \(R^* = 0\), then \(R(n) \approx \frac{c}{n}\)
    • If \(R^* = 0\), then \(CV_k \approx \frac{ck}{(k-1)n}\)

    Now, look at the ratio between them: \[\frac{R(n)}{CV_k} \approx \frac{c/n}{ck/((k-1)n)} = \frac{c}{n} \cdot \frac{(k-1)n}{ck} = \frac{k-1}{k}\]

    This gives us the correction formula by isolating \(R(n)\): 通过分离 \(R(n)\),我们得到了校正公式: \[R(n) \approx \left(\frac{k-1}{k}\right) \cdot CV_k\] This corrected version is denoted \(\tilde{CV_k}\).这个校正版本表示为 \(\tilde{CV_k}\)

Code Analysis (Slides 92-93)

The Python code is an experiment designed to test the correction formula.

  • Goal: Compare the “Raw CV Error” (\(CV_k\)), the “Corrected CV Error” (\(\tilde{CV_k}\)), and the “True Test Error” (\(R(n)\)) in a realistic setting.

  • Key Setup:

    1. def f(x): Defines the true, underlying function \(y = x^2 + 15\sin(x)\).
    2. noise_std = 0.5: This is the most important line. It adds significant random noise to the data. This ensures that the irreducible error \(R^*\) is large and \(R^* > 0\).
    3. y = f(...) + np.random.normal(...): Creates the noisy training data (the blue dots).
  • CV Calculation (Standard K-Fold):

    • kf = KFold(...): Sets up 5-fold CV (\(k=5\)).
    • for train_index, val_index in kf.split(x):: This is the standard loop. It trains on 4 folds and validates on 1 fold.
    • cv_error = np.mean(cv_mse_list): Calculates the raw \(CV_5\) error. This is the first result (e.g., 0.2715).
  • Correction Calculation:

    • correction_factor = (k_splits - 1) / k_splits: This is \(\frac{k-1}{k}\), which is \(4/5 = 0.8\).
    • corrected_cv_error = correction_factor * cv_error: This applies the flawed formula from the math section (\(0.2715 \times 0.8\)). This is the second result (e.g., 0.2172).
  • “True” Test Error Calculation:

    • knn.fit(x, y): Trains the final model on the entire noisy dataset.
    • n_test = 1000: Creates a new, large test set to estimate the true error.
    • true_test_error = mean_squared_error(...): Calculates the error of the final model on this new test set. This is our best estimate of \(R(n)\) (e.g., 0.2725).
  • Analysis of Results (Slide 93):

    • Raw 5-Fold CV MSE: 0.2715
    • True test error: 0.2725
    • Corrected 5-Fold CV MSE: 0.2172

    The Raw CV Error (0.2715) is an excellent estimate of the True Test Error (0.2725). The Corrected Error (0.2172) is much worse. This experiment proves that when noise (\(R^*\)) is present, the simple correction formula should not be used.

统计机器学习Lecture-4

Lecturer: Prof.XIA DONG

1. What is Classification?

Classification is a type of supervised machine learning where the goal is to predict a categorical or qualitative response. Unlike regression where you predict a continuous numerical value (like a price or temperature), classification assigns an input to a specific category or class. 分类是一种监督式机器学习,其目标是预测分类或定性响应。与预测连续数值(例如价格或温度)的回归不同,分类将输入分配到特定的类别或类别。

Key characteristics:

  • Goal: Predict the class of a subject based on input features.

  • Output (Response): The output is a category, such as ‘Yes’/‘No’, ‘Spam’/‘Not Spam’, or ‘High’/‘Medium’/‘Low’.

  • Applications: Common examples include email spam detectors, medical diagnosis (e.g., virus carrier vs. non-carrier), and fraud detection.

    • 目标:根据输入特征预测主题的类别。
    • 输出(响应):输出是一个类别,例如“是”/“否”、“垃圾邮件”/“非垃圾邮件”或“高”/“中”/“低”。
    • 应用:常见示例包括垃圾邮件检测器、医学诊断(例如,病毒携带者与非病毒携带者)和欺诈检测。 The example used in the slides is a credit card Default dataset. The goal is to predict whether a customer will default (‘Yes’ or ‘No’) on their payments based on their monthly income and account balance.

## Why Not Use Linear Regression?为什么不使用线性回归?

At first, it might seem possible to use linear regression for classification. For a binary (two-class) problem like the default dataset, you could code the outcomes as numbers, for example:

  • Default = ‘No’ => \(y = 0\)
  • Default = ‘Yes’ => \(y = 1\)

You could then fit a standard linear regression model: \(Y \approx \beta_0 + \beta_1 X\). In this context, we would interpret the prediction \(\hat{y}\) as the probability of default, so we’d be modeling \(P(Y=1|X) = \beta_0 + \beta_1 X\).

However, this approach has two major problems: 然而,这种方法有两个主要问题: 1. The Output Is Not a Probability A linear model can produce outputs that are less than 0 or greater than 1. This doesn’t make sense for a probability, which must always be between 0 and 1.

The image below is the most important one for understanding this issue. The left plot shows a linear regression line fit to the 0/1 default data. You can see the line goes below 0 and would eventually go above 1 for higher balances. The right plot shows a logistic regression curve, which always stays between 0 and 1.

  • Left (Linear Regression): The straight blue line predicts probabilities < 0 for low balances.
  • Right (Logistic Regression): The S-shaped blue curve correctly constrains the probability output between 0 and 1.

2. It Doesn’t Work for Multi-Class Problems If you have more than two categories (e.g., ‘mild’, ‘moderate’, ‘severe’), you might code them as 0, 1, and 2. A linear regression model would incorrectly assume that the “distance” between ‘mild’ and ‘moderate’ is the same as the distance between ‘moderate’ and ‘severe’, which is usually not a valid assumption.

1. 输出不是概率 线性模型可以产生小于 0 或大于 1 的输出。这对于概率来说毫无意义,因为概率必须始终介于 0 和 1 之间。

下图是理解这个问题最重要的图。左图显示了与 0/1 默认数据拟合的线性回归线。您可以看到,该线低于 0,并且最终会随着余额的增加而高于 1。右图显示了逻辑回归曲线,它始终保持在 0 和 1 之间。

  • 左图(线性回归):蓝色直线预测低余额的概率小于 0。
  • 右图(逻辑回归):S 形蓝色曲线正确地将概率输出限制在 0 和 1 之间。

2.它不适用于多类别问题 如果您有两个以上的类别(例如,“轻度”、“中度”、“重度”),您可能会将它们编码为 0、1 和 2。线性回归模型会错误地假设“轻度”和“中度”之间的“距离”与“中度”和“重度”之间的距离相同,这通常不是一个有效的假设。

## The Solution: Logistic Regression

Instead of modeling the response \(y\) directly, logistic regression models the probability that \(y\) belongs to a particular class. To solve the issue of the output not being a probability, it uses the logistic function (also known as the sigmoid function).

This function takes any real-valued input and squeezes it into an output between 0 and 1.

The formula for the probability in a logistic regression model is: \[P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\] This S-shaped function, shown in the right-hand plot above, ensures that the output is always a valid probability. We can then set a threshold (e.g., 0.5) to make the final class prediction. If \(P(Y=1|X) > 0.5\), we predict ‘Yes’; otherwise, we predict ‘No’.

## 解决方案:逻辑回归

逻辑回归不是直接对响应 \(y\) 进行建模,而是对 \(y\) 属于特定类别的概率进行建模。为了解决输出不是概率的问题,它使用了逻辑函数(也称为 S 型函数)。

此函数接受任何实值输入,并将其压缩为介于 0 和 1 之间的输出。

逻辑回归模型中的概率公式为: \[P(Y=1|X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\] 如上图右侧所示,这个 S 形函数确保输出始终是有效概率。然后,我们可以设置一个阈值(例如 0.5)来进行最终的类别预测。如果 \(P(Y=1|X) > 0.5\),则预测“是”;否则,预测“否”。

## Data Visualization & Code in Python

The slides use R to visualize the data. The boxplots are particularly important because they show which variable is a better predictor.

  • Balance vs. Default: The boxplots for balance show a clear difference. The median balance for those who default (‘Yes’) is much higher than for those who do not (‘No’). This suggests balance is a strong predictor.

  • Income vs. Default: The boxplots for income show a lot of overlap. The median incomes for both groups are very similar. This suggests income is a weak predictor.

  • 余额 vs. 违约:余额的箱线图显示出明显的差异。违约者(“是”)的余额中位数远高于未违约者(“否”)。这表明余额是一个强有力的预测指标

  • 收入 vs. 违约:收入的箱线图显示出很大的重叠。两组的收入中位数非常相似。这表明收入是一个弱的预测指标

Here’s how you could perform similar analysis and modeling in Python using seaborn and scikit-learn.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assume 'default_data.csv' has columns: 'default' (Yes/No), 'balance', 'income'
# You would load your data like this:
# df = pd.read_csv('default_data.csv')

# For demonstration, let's create some sample data
data = {
'balance': [1200, 2100, 800, 1800, 500, 1600, 2200, 1900],
'income': [45000, 60000, 30000, 55000, 25000, 48000, 70000, 65000],
'default': ['No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)

# --- 1. Data Visualization (like the slides) ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Predictor Analysis for Default')

# Boxplot for Balance
sns.boxplot(ax=axes[0], x='default', y='balance', data=df)
axes[0].set_title('Balance vs. Default Status')

# Boxplot for Income
sns.boxplot(ax=axes[1], x='default', y='income', data=df)
axes[1].set_title('Income vs. Default Status')

plt.show()

# --- 2. Logistic Regression Modeling ---

# Convert categorical 'default' column to 0s and 1s
df['default_encoded'] = df['default'].apply(lambda x: 1 if x == 'Yes' else 0)

# Define features (X) and target (y)
X = df[['balance', 'income']]
y = df['default_encoded']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on new data
# For example, a person with a $2000 balance and $50,000 income
new_customer = [[2000, 50000]]
predicted_prob = model.predict_proba(new_customer)
prediction = model.predict(new_customer)

print(f"Customer data: Balance=2000, Income=50000")
print(f"Probability of No Default vs. Default: {predicted_prob}") # [[P(No), P(Yes)]]
print(f"Final Prediction (0=No, 1=Yes): {prediction}")

2. the mathematical foundation of logistic regression

This set of slides explains the mathematical foundation of logistic regression, how its parameters are estimated using Maximum Likelihood Estimation (MLE), and how an iterative algorithm called Newton-Raphson is used to perform this estimation.

逻辑回归的数学基础、如何使用最大似然估计 (MLE) 估计其参数,以及如何使用名为 Newton-Raphson 的迭代算法进行估计。

2.1 The Logistic Regression Model: From Probabilities to Log-Odds逻辑回归模型:从概率到对数几率

The core of logistic regression is transforming a linear model into a valid probability. This is done using the logistic function, also known as the sigmoid function. 逻辑回归的核心是将线性模型转换为有效的概率。这可以通过逻辑函数(也称为 S 型函数)来实现。 #### Key Mathematical Formulas

  1. Probability of Class 1: The model assumes the probability of an observation \(\mathbf{x}\) belonging to class 1 is given by the sigmoid function: \[ P(y=1|\mathbf{x}) = \frac{1}{1 + \exp(-\beta^T \mathbf{x})} = \frac{\exp(\beta^T \mathbf{x})}{1 + \exp(\beta^T \mathbf{x})} \] This function always outputs a value between 0 and 1, making it perfect for modeling probabilities.

  2. Odds: The odds are the ratio of the probability of an event happening to the probability of it not happening. \[ \text{Odds} = \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \exp(\beta^T \mathbf{x}) \]

  3. Log-Odds (Logit): By taking the natural logarithm of the odds, we get a linear relationship with the predictors. This is called the logit transformation. \[ \text{logit}(P(y=1|\mathbf{x})) = \log\left(\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})}\right) = \beta^T \mathbf{x} \] This final equation is the heart of the model. It states that the log-odds of the outcome are a linear function of the predictors. This provides a great interpretation: a one-unit increase in a predictor \(x_j\) changes the log-odds by \(\beta_j\).

  4. 类别 1 的概率:该模型假设观测值 \(\mathbf{x}\) 属于类别 1 的概率由 S 型函数给出: \[ P(y=1|\mathbf{x}) = \frac{1}{1 + \exp(-\beta^T \mathbf{x})} = \frac{\exp(\beta^T \mathbf{x})}{1 + \exp(\beta^T \mathbf{x})} \] 此函数的输出值始终介于 0 和 1 之间,非常适合用于概率建模。

  5. 几率:**几率是事件发生的概率与不发生的概率之比。 \[ \text{Odds} = \frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})} = \exp(\beta^T \mathbf{x}) \]

  6. 对数概率 (Logit):通过对概率取自然对数,我们可以得到概率与预测变量之间的线性关系。这被称为logit 变换\[ \text{logit}(P(y=1|\mathbf{x})) = \log\left(\frac{P(y=1|\mathbf{x})}{P(y=0|\mathbf{x})}\right) = \beta^T \mathbf{x} \] 最后一个方程是模型的核心。它指出结果的对数概率是预测变量的线性函数。这提供了一个很好的解释:预测变量 \(x_j\) 每增加一个单位,对数概率就会改变 \(\beta_j\)

2.2 Fitting the Model: Maximum Likelihood Estimation (MLE) 拟合模型:最大似然估计 (MLE)

Unlike linear regression, which uses least squares to find the best-fit line, logistic regression uses Maximum Likelihood Estimation (MLE). The goal of MLE is to find the parameter values (the \(\beta\) coefficients) that maximize the probability of observing the actual data that we have. 与使用最小二乘法寻找最佳拟合线的线性回归不同,逻辑回归使用最大似然估计 (MLE)。MLE 的目标是找到使观测到实际数据的概率最大化的参数值(\(\beta\) 系数)。

  1. Likelihood Function: This is the joint probability of observing all the data points in our sample. Assuming each observation is independent, it’s the product of the individual probabilities: 1.似然函数:这是观测到样本中所有数据点的联合概率。假设每个观测值都是独立的,它是各个概率的乘积: \[ L(\beta) = \prod_{i=1}^{n} P(y_i|\mathbf{x}_i) \] A clever way to write this for a binary (0/1) outcome is: \[ L(\beta) = \prod_{i=1}^{n} \frac{\exp(y_i \beta^T \mathbf{x}_i)}{1 + \exp(\beta^T \mathbf{x}_i)} \]

  2. Log-Likelihood Function: Products are difficult to work with mathematically, so we work with the logarithm of the likelihood, which turns the product into a sum. Maximizing the log-likelihood is the same as maximizing the likelihood.

  3. 对数似然函数:乘积在数学上很难处理,所以我们使用似然的对数,将乘积转化为和。最大化对数似然与最大化似然相同。 \[ \ell(\beta) = \log(L(\beta)) = \sum_{i=1}^{n} \left[ y_i \beta^T \mathbf{x}_i - \log(1 + \exp(\beta^T \mathbf{x}_i)) \right] \] Key Takeaway: The slides correctly state that there is no explicit formula to solve for the \(\hat{\beta}\) that maximizes this function. We must find it using a numerical optimization algorithm. 没有明确的公式来求解最大化该函数的\(\hat{\beta}\)。我们必须使用数值优化算法来找到它。

2.3 The Algorithm: Newton-Raphson 算法:牛顿-拉夫森算法

The slides introduce the Newton-Raphson algorithm as the method to find the optimal \(\hat{\beta}\). It’s an efficient iterative algorithm for finding the roots of a function (i.e., where \(f(x)=0\)).

How does this apply to logistic regression? To maximize the log-likelihood function \(\ell(\beta)\), we need to find the point where its derivative (gradient) is equal to zero. So, Newton-Raphson is used to solve \(\frac{d\ell(\beta)}{d\beta} = 0\).

它是一种高效的迭代算法,用于求函数的根(即,当\(f(x)=0\)时)。

这如何应用于逻辑回归? 为了最大化对数似然函数 \(\ell(\beta)\),我们需要找到其导数(梯度)等于零的点。因此,牛顿-拉夫森法用于求解 \(\frac{d\ell(\beta)}{d\beta} = 0\)

The General Newton-Raphson Method

The algorithm starts with an initial guess, \(x^{old}\), and iteratively refines it using the following update rule, which is based on a Taylor series approximation: \[ x^{new} = x^{old} - \frac{f(x^{old})}{f'(x^{old})} \] where \(f'(x)\) is the derivative of \(f(x)\). You repeat this step until the value of \(x\) converges.

该算法从初始估计 \(x^{old}\) 开始,并使用以下基于泰勒级数近似的更新规则迭代地对其进行优化: \[ x^{new} = x^{old} - \frac{f(x^{old})}{f'(x^{old})} \] 其中 \(f'(x)\)\(f(x)\) 的导数。重复此步骤,直到 \(x\) 的值收敛。

Important Image: Newton-Raphson Example (\(x^3 - 4 = 0\))

[Image showing iterations of Newton-Raphson]

This slide is a great illustration of the algorithm’s power. * Goal: Find \(x\) such that \(f(x) = x^3 - 4 = 0\). * Function: \(f(x) = x^3 - 4\) * Derivative: \(f'(x) = 3x^2\) * Update Rule: \(x^{new} = x^{old} - \frac{(x^{old})^3 - 4}{3(x^{old})^2}\) Starting with a guess of \(x^{old} = 2\), the algorithm converges to the true answer (\(4^{1/3} \approx 1.5874\)) in just 4 steps.

  • 目标:找到 \(x\),使得 \(f(x) = x^3 - 4 = 0\)
  • 函数\(f(x) = x^3 - 4\)
  • 导数\(f'(x) = 3x^2\)
  • 更新规则\(x^{new} = x^{old} - \frac{(x^{old})^3 - 4}{3(x^{old})^2}\)\(x^{old} = 2\) 的猜测开始,该算法仅用 4 步就收敛到真实答案 (\(4^{1/3} \approx 1.5874\))。

Code Understanding (Python)

The slides show Python code implementing Newton-Raphson. Let’s break down the key function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np

# Define the function we want to find the root of
def f(x):
return np.exp(x) - x*x + 3 * np.sin(x)

# Define its derivative
def f_prime(x):
return np.exp(x) - 2*x + 3 * np.cos(x)

# Newton-Raphson method
def newton_raphson(x0, tol=1e-10, max_iter=100):
x = x0 # Start with the initial guess
for i in range(max_iter):
fx = f(x) # Calculate f(x_old)
fpx = f_prime(x) # Calculate f'(x_old)

if fpx == 0: # Cannot divide by zero
print("Zero derivative. No solution found.")
return None

# This is the core update rule
x_new = x - fx / fpx

# Check if the change is small enough to stop
if abs(x_new - x) < tol:
print(f"Converged to {x_new} after {i+1} iterations.")
return x_new

# Update x for the next iteration
x = x_new

print("Exceeded maximum iterations. No solution found.")
return None

# Initial guess and execution
x0 = 0.5
root = newton_raphson(x0)

The slides show that with a good initial guess (x0 = 0.5), the algorithm converges quickly. With a bad one (x0 = 50), it still converges but takes many more steps. This highlights the importance of the starting point. The slides also show an implementation of Gradient Descent, another popular optimization algorithm which uses the update rule x_new = x - learning_rate * gradient.

Provide a great case study on logistic regression, particularly on the important concept of confounding variables. Here’s a summary covering the math, code, and key insights.

  1. Core Concept: Logistic Regression 📈 # 核心概念:逻辑回归 📈

Logistic regression is a statistical method used for binary classification, which means predicting an outcome that can only be one of two things (e.g., Yes/No, True/False, 1/0).

In this example, the goal is to predict the probability that a customer will default on a loan (Yes or No) based on factors like their account balance, income, and whether they are a student.

The core of logistic regression is the sigmoid (or logistic) function, which takes any real-valued number and squishes it to a value between 0 and 1, representing a probability.

\[ \hat{P}(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}} \]

  • \(\hat{P}(Y=1|X)\) is the predicted probability of the outcome being “Yes” (e.g., default).
  • \(\beta_0\) is the intercept.
  • \(\beta_1, ..., \beta_p\) are the coefficients for each input variable (\(X_1, ..., X_p\)). The model’s job is to find the best values for these \(\beta\) coefficients.

逻辑回归是一种用于二元分类的统计方法,这意味着预测结果只能是两种情况之一(例如,是/否、真/假、1/0)。

在本例中,目标是根据客户账户“余额”、“收入”以及是否为“学生”等因素,预测客户拖欠贷款(是或否)的概率。

逻辑回归的核心是Sigmoid(或逻辑)函数,它将任何实数压缩为介于 0 和 1 之间的值,以表示概率。

\[ \hat{P}(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p)}} \]

  • \(\hat{P}(Y=1|X)\) 是结果为“是”(例如,默认)的预测概率。
  • \(\beta_0\) 是截距。
  • \(\beta_1, ..., \beta_p\) 是每个输入变量 (\(X_1, ..., X_p\)) 的系数。模型的任务是找到这些 \(\beta\) 系数的最佳值。

3.1 How the Model “Learns” (Mathematical Foundation)

The slides show that the model’s coefficients (\(\beta\)) are found using an algorithm like Newton-Raphson. This is an iterative process to find the values that maximize the log-likelihood function. Think of this as finding the coefficient values that make the observed data most probable.这是一个迭代过程,用于查找最大化对数似然函数的值。可以将其视为查找使观测数据概率最大的系数值。

The key slide for this is the one titled “Newton-Raphson Iterative Algorithm”. It shows the formulas for: * The Gradient (\(\nabla\ell\)): The direction of the steepest ascent of the log-likelihood function. * The Hessian (\(H\)): The curvature of the log-likelihood function.

  • 梯度 (\(\nabla\ell\)):对数似然函数最陡上升的方向。
  • 黑森矩阵 (\(H\)):对数似然函数的曲率。

The updating rule is given by: \[ \beta^{new} = \beta^{old} - H^{-1}\nabla\ell \] This formula is used repeatedly until the coefficient values stop changing significantly, meaning the algorithm has converged to the best fit. This process is also referred to as Iteratively Reweighted Least Squares (IRLS). 此公式反复使用,直到系数值不再发生显著变化,这意味着算法已收敛到最佳拟合值。此过程也称为迭代重加权最小二乘法 (IRLS)


3.2 The Puzzle: A Tale of Two Models 🕵️‍♂️

The most important story in these slides is how the effect of being a student changes depending on the model. This is a classic example of a confounding variable.

Model 1: Simple Logistic Regression (Default vs. Student)

When predicting default using only student status, the model is: default ~ student

From the slides, the coefficients are: * Intercept (\(\beta_0\)): -3.5041 * student[Yes] (\(\beta_1\)): 0.4049 (positive)

The equation for the log-odds is: \[ \log\left(\frac{P(\text{default})}{1-P(\text{default})}\right) = -3.5041 + 0.4049 \times (\text{is\_student}) \]

Conclusion: The positive coefficient (0.4049) suggests that students are more likely to default than non-students. The slides calculate the probabilities: * Student Default Probability: 4.31% * Non-Student Default Probability: 2.92%

学生身份的影响如何根据模型而变化。这是一个典型的混杂变量的例子。

模型 1:简单逻辑回归(违约 vs. 学生)

仅使用学生身份预测违约时,模型为: default ~ student

幻灯片中显示的系数为: * 截距 (\(\beta_0\)): -3.5041 * 学生[是] (\(\beta_1\)): 0.4049(正)

对数概率公式为: \[ \log\left(\frac{P(\text{default})}{1-P(\text{default})}\right) = -3.5041 + 0.4049 \times (\text{is\_student}) \]

结论:正系数 (0.4049) 表明学生比非学生更有可能违约。幻灯片计算了以下概率: * 学生违约概率:4.31% * 非学生违约概率:2.92%

3.3 Model 2: Multiple Logistic Regression (Default vs. All Variables) 模型 2:多元逻辑回归(违约 vs. 所有变量)

When we add balance and income to the model, it becomes: default ~ student + balance + income

From the slides, the new coefficients are: * Intercept (\(\beta_0\)): -10.8690 * balance (\(\beta_1\)): 0.0057 * income (\(\beta_2\)): 0.0030 * student[Yes] (\(\beta_3\)): -0.6468 (negative)

The Shocking Twist! The coefficient for student[Yes] is now negative.

Conclusion: When we control for balance and income, students are actually less likely to default than non-students with the same balance and income.

Why the Change? The Confounding Variable Explained

The key insight, explained on the slide with multi-colored text bubbles, is that students, on average, have higher credit card balances.

  • In the simple model, the student variable was inadvertently capturing the risk associated with having a high balance. The model mistakenly concluded “being a student causes default.”
  • In the multiple model, the balance variable properly accounts for the risk from a high balance. With that effect isolated, the student variable can show its true, underlying relationship with default, which is negative.

This demonstrates why it’s crucial to consider multiple relevant variables to avoid drawing incorrect conclusions. The most important slides are the ones that present this paradox and its explanation.

令人震惊的转折! student[Yes] 的系数现在为

结论:当我们控制余额和收入时,学生实际上比具有相同余额和收入的非学生更于违约

为什么会有变化?混杂变量解释

幻灯片上用彩色文字气泡解释了关键的见解,即学生平均拥有更高的信用卡余额

  • 在简单模型中,“学生”变量无意中捕捉到了高余额带来的风险。该模型错误地得出了“学生身份导致违约”的结论。
  • 在多元模型中,“余额”变量正确地解释了高余额带来的风险。在分离出这一影响后,“学生”变量可以显示其与违约之间真实的潜在关系,即负相关关系。

这说明了为什么考虑多个相关变量以避免得出错误结论至关重要。


Code Implementation: R vs. Python

The slides use R’s glm() (Generalized Linear Model) function. Here’s how you would replicate this in Python.

R Code (from slides)

1
2
3
4
5
6
7
# Simple Model
glmod2 <- glm(default ~ student, data=Default, family=binomial)
summary(glmod2)

# Multiple Model
glmod3 <- glm(default ~ ., data=Default, family=binomial) # '.' means all other variables
summary(glmod3)

Python Equivalent

We can use two popular libraries: statsmodels (which gives R-style summaries) and scikit-learn (the standard for machine learning).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

# Assume 'Default' is a pandas DataFrame with columns:
# 'default' (0/1), 'student' (0/1), 'balance', 'income'

# --- Using statsmodels (recommended for interpretation) ---

# Prepare the data
# For statsmodels, we need to manually add the intercept
X_simple = Default[['student']]
X_simple = sm.add_constant(X_simple)
y = Default['default']

X_multiple = Default[['student', 'balance', 'income']]
X_multiple = sm.add_constant(X_multiple)

# Simple Model: default ~ student
model_simple = sm.Logit(y, X_simple).fit()
print("--- Simple Model ---")
print(model_simple.summary())

# Multiple Model: default ~ student + balance + income
model_multiple = sm.Logit(y, X_multiple).fit()
print("\n--- Multiple Model ---")
print(model_multiple.summary())


# --- Using scikit-learn (recommended for prediction tasks) ---

# Prepare the data (scikit-learn adds intercept by default)
X_simple_sk = Default[['student']]
y_sk = Default['default']

X_multiple_sk = Default[['student', 'balance', 'income']]

# Simple Model
clf_simple = LogisticRegression().fit(X_simple_sk, y_sk)
print(f"\nSimple Model Intercept (scikit-learn): {clf_simple.intercept_}")
print(f"Simple Model Coefficient (scikit-learn): {clf_simple.coef_}")

# Multiple Model
clf_multiple = LogisticRegression().fit(X_multiple_sk, y_sk)
print(f"\nMultiple Model Intercept (scikit-learn): {clf_multiple.intercept_}")
print(f"Multiple Model Coefficients (scikit-learn): {clf_multiple.coef_}")

4 Making Predictions and the Decision Boundary 🎯进行预测和决策边界

Once the model is trained (i.e., we have the coefficients \(\hat{\beta}\)), we can make predictions. 一旦模型训练完成(即,我们有了系数 \(\hat{\beta}\)),我们就可以进行预测了。 ## Math Behind Predictions

The model outputs the log-odds, which can be converted into a probability. A key concept is the decision boundary, which is the threshold where the model is uncertain (probability = 50%). 模型输出对数概率,它可以转换为概率。一个关键概念是决策边界,它是模型不确定的阈值(概率 = 50%)。

  1. The Estimated Odds: The core output of the linear part of the model is the exponential of the linear equation, which gives the odds of the outcome being ‘Yes’ (or 1). 估计概率:模型线性部分的核心输出是线性方程的指数,它给出了结果为“是”(或 1)的概率。

    \[ \]$$\frac{\hat{P}(y=1|\mathbf{x}_0)}{\hat{P}(y=0|\mathbf{x}_0)} = \exp(\hat{\beta}^\top \mathbf{x}_0)

    \[ \]\[ \]

  2. The Decision Rule: We classify a new observation \(\mathbf{x}_0\) by comparing its predicted odds to a threshold \(\delta\). 决策规则:我们通过比较新观测值 \(\mathbf{x}_0\) 的预测概率与阈值 \(\delta\) 来对其进行分类。

    • Predict \(y=1\) if \(\exp(\hat{\beta}^\top \mathbf{x}_0) > \delta\)
    • Predict \(y=0\) if \(\exp(\hat{\beta}^\top \mathbf{x}_0) < \delta\) A common default is \(\delta=1\), which means we predict ‘Yes’ if the probability is greater than 0.5.
  3. The Linear Boundary: The decision boundary itself is where the odds are exactly equal to the threshold. By taking the logarithm, we see that this boundary is a linear equation. This is why logistic regression is called a linear classifier. 线性边界:决策边界本身就是概率恰好等于阈值的地方。取对数后,我们发现这个边界是一个线性方程。这就是逻辑回归被称为线性分类器的原因。 \[ \]$$\hat{\beta}^\top \mathbf{x} = \log(\delta)

    \[ \]$$For \(\delta=1\), the boundary is simply \(\hat{\beta}^\top \mathbf{x} = 0\).

This concept is visualized perfectly in the slide titled “Linear Classifier,” which shows a straight line neatly separating two classes of data points. 题为“线性分类器”的幻灯片完美地展示了这一概念,它展示了一条直线,将两类数据点巧妙地分隔开来。

Visualizing the Confounding Effect

The most important image in this set is Figure 4.3, as it visually explains the confounding puzzle from the first set of slides.

  • Right Panel (Boxplots): This shows that students (Yes) tend to have higher credit card balances than non-students (No). This is the source of the confounding.
  • Left Panel (Default Rates):
    • The dashed lines show the overall default rates. The orange line (students) is higher than the blue line (non-students). This matches our simple model (default ~ student).
    • The solid S-shaped curves show the probability of default as a function of balance. For any given balance, the blue curve (non-students) is slightly higher than the orange curve (students). This means that at the same level of debt, students are less likely to default. This matches our multiple regression model (default ~ student + balance + income).

This single figure brilliantly illustrates how a variable can appear to have one effect in isolation but the opposite effect when controlling for a confounding factor. * 右侧面板(箱线图):这表明学生(是)的信用卡余额往往高于非学生(否)。这就是混杂效应的根源。 * 左图(违约率): * 虚线显示总体违约率。橙色线(学生)高于蓝色线(非学生)。这与我们的简单模型(“违约 ~ 学生”)相符。 * S 形实线显示违约概率与余额的关系。对于任何给定的余额,蓝色曲线(非学生)略高于橙色曲线(学生)。这意味着在相同的债务水平下,学生违约的可能性较小。这与我们的多元回归模型(“违约 ~ 学生 + 余额 + 收入”)相符。

这张图巧妙地说明了为什么一个变量在单独使用时似乎会产生一种影响,但在控制混杂因素后却会产生相反的影响。

An Important Edge Case: Perfect Separation ⚠️

What happens if the data can be perfectly separated by a straight line? 如果数据可以用一条直线完美分离,会发生什么?

One might think this is the ideal scenario, but it causes a problem for the logistic regression algorithm. The model will try to find coefficients that make the probabilities for each class as close to 1 and 0 as possible. To do this, the magnitude of the coefficients (\(\hat{\beta}\)) must grow infinitely large. 人们可能认为这是理想情况,但它会给逻辑回归算法带来问题。模型会尝试找到使每个类别的概率尽可能接近 1 和 0 的系数。为此,系数 (\(\hat{\beta}\)) 的大小必须无限大。

The slide “Non-convergence for perfectly separated case” demonstrates this:

  • The Code: It generates two distinct, non-overlapping clusters of data points using Python’s scikit-learn.

  • Parameter Estimates Graph: It shows the Intercept, Coefficient 1, and Coefficient 2 values increasing or decreasing without limit as the algorithm runs through more iterations. They never converge to a stable value.

  • Decision Boundary Graph: The decision boundary itself might look reasonable, but the underlying coefficients are unstable.

  • 代码:它使用 Python 的 scikit-learn 生成两个不同的、不重叠的数据点聚类。

  • 参数估计图:它显示“截距”、“系数 1”和“系数 2”的值随着算法迭代次数的增加或减少而无限增大或减小。它们永远不会收敛到一个稳定的值。

  • 决策边界图:决策边界本身可能看起来合理,但底层系数是不稳定的。

Key Takeaway: If your logistic regression model fails to converge, the first thing you should check for is perfect separation in your training data. 关键要点:如果您的逻辑回归模型未能收敛,您应该检查的第一件事就是训练数据是否完美分离。

Code Understanding

The slides provide useful code snippets in both R and Python.

R Code (Plotting Predictions)

This code generates the plot with the two S-shaped curves (one for students, one for non-students) showing the probability of default as balance increases.

1
2
3
4
5
6
7
8
9
10
11
12
// # Create a data frame for prediction with a range of balances
// # One version for students, one for non-students
Default.st <- data.frame(balance=seq(500, 2500, by=1), student="Yes")
Default.nonst <- data.frame(balance=seq(500, 2500, by=1), student="No")

// # Use the trained multiple regression model (glmod3) to predict probabilities
pred.st <- predict(glmod3, Default.st, type="response")
pred.nonst <- predict(glmod3, Default.nonst, type="response")

// # Plot the results
plot(Default.st$balance, pred.st, type="l", col="red", ...) // Students
lines(Default.nonst$balance, pred.nonst, col="blue", ...) // Non-students

Python Code (Visualizing the Decision Boundary)

This Python code uses scikit-learn and matplotlib to create the plot showing the linear decision boundary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# 1. Generate synthetic data with two classes
X, y = make_classification(...)

# 2. Initialize and fit the logistic regression model
model = LogisticRegression()
model.fit(X, y)

# 3. Create a mesh grid of points to make predictions over the entire plot area
xx, yy = np.meshgrid(...)

# 4. Predict the probability for each point on the grid
probs = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

# 5. Plot the decision boundary where the probability is 0.5
plt.contour(xx, yy, probs.reshape(xx.shape), levels=[0.5], ...)

# 6. Scatter plot the actual data points
plt.scatter(X[:, 0], X[:, 1], c=y, ...)
plt.show()

Other Important Remarks

The “Remarks” slide briefly mentions some key extensions:

  • Probit Model: An alternative to logistic regression that uses the cumulative distribution function (CDF) of the standard normal distribution instead of the sigmoid function. The results are often very similar.

  • Softmax Regression: An extension of logistic regression used for multi-class classification (when there are more than two possible outcomes).

  • Probit 模型:逻辑回归的替代方法,它使用标准正态分布的累积分布函数 (CDF) 代替 S 型函数。结果通常非常相似。

  • Softmax 回归:逻辑回归的扩展,用于多类分类(当存在两个以上可能结果时)。

5. Here is a summary of the slides on Linear Discriminant Analysis (LDA), including the key mathematical formulas, visual explanations, and how to implement it in Python.

The Main Idea: Classification Using Probabilities 使用概率进行分类

Linear Discriminant Analysis (LDA) is a classification method. For a given input x, it calculates the probability that x belongs to each class and then assigns x to the class with the highest probability.

It does this using Bayes’ Theorem, which provides a formula for the posterior probability \(P(Y=k | X=x)\), or the probability that the class is \(k\) given the input \(x\). 线性判别分析 (LDA) 是一种分类方法。对于给定的输入 x,它计算 x 属于每个类别的概率,然后将 x 分配给概率最高的类别。

它使用贝叶斯定理来实现这一点,该定理提供了后验概率 \(P(Y=k | X=x)\) 的公式,即给定输入 \(x\),该类别属于 \(k\) 的概率。 \[ p_k(x) = P(Y=k|X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^{K} \pi_l f_l(x)} \]

  • \(p_k(x)\) is the posterior probability we want to maximize.
  • \(\pi_k = P(Y=k)\) is the prior probability of class \(k\) (how common the class is overall).
  • \(f_k(x) = f(x|Y=k)\) is the class-conditional probability density function of observing input \(x\) if it belongs to class \(k\).

To classify a new observation \(x\), we simply find the class \(k\) that makes \(p_k(x)\) the largest. 为了对新的观察值 \(x\) 进行分类,我们只需找到使 \(p_k(x)\) 最大的类别 \(k\) 即可。


Key Assumptions of LDA

LDA’s power comes from a specific, simplifying assumption about the data’s distribution. LDA 的强大之处在于它对数据分布进行了特定的简化假设。

  1. Gaussian Distribution: LDA assumes that the data within each class \(k\) follows a p-dimensional multivariate normal (or Gaussian) distribution, denoted as \(X|Y=k \sim \mathcal{N}(\mu_k, \Sigma)\).

  2. Common Covariance: A crucial assumption is that all classes share the same covariance matrix \(\Sigma\). This means that while the classes may have different centers (means, \(\mu_k\)), their shape and orientation (covariance, \(\Sigma\)) are identical.

  3. 高斯分布:LDA 假设每个类 \(k\) 中的数据服从 p 维多元正态(或高斯)分布,表示为 \(X|Y=k \sim \mathcal{N}(\mu_k, \Sigma)\)

  4. 共同协方差:一个关键假设是所有类共享相同的协方差矩阵 \(\Sigma\)。这意味着虽然类可能具有不同的中心(均值,\(\mu_k\)),但它们的形状和方向(协方差,\(\Sigma\))是相同的。

The probability density function for a class \(k\) is: \[ f_k(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}} \exp \left( -\frac{1}{2}(x - \mu_k)^T \Sigma^{-1} (x - \mu_k) \right) \]

The image above (from your slide “Knowing normal distribution”) illustrates this. The two “bells” have different centers (different \(\mu_k\)) but similar shapes. The one on the right is “tilted,” indicating correlation between variables, which is captured in the shared covariance matrix \(\Sigma\). 上图(摘自幻灯片“了解正态分布”)说明了这一点。两个“钟”形的中心不同(\(\mu_k\) 不同),但形状相似。右边的钟形“倾斜”,表示变量之间存在相关性,这体现在共享协方差矩阵 \(\Sigma\) 中。


The Math Behind LDA: The Discriminant Function 判别函数

Since we only need to find the class \(k\) that maximizes the posterior probability \(p_k(x)\), we can simplify the math. The denominator in Bayes’ theorem is the same for all classes, so we only need to maximize the numerator: \(\pi_k f_k(x)\). 由于我们只需要找到使后验概率 \(p_k(x)\) 最大化的类别 \(k\),因此可以简化数学计算。贝叶斯定理中的分母对于所有类别都是相同的,因此我们只需要最大化分子:\(\pi_k f_k(x)\)。 Taking the logarithm (which doesn’t change which class is maximal) and removing constant terms gives us the linear discriminant function, \(\delta_k(x)\): 取对数(这不会改变哪个类别是最大值)并移除常数项,得到线性判别函数\(\delta_k(x)\)

\[ \delta_k(x) = x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log(\pi_k) \]

This function is linear in \(x\), which is why the method is called Linear Discriminant Analysis. The decision boundary between any two classes, say class \(k\) and class \(l\), is the set of points where \(\delta_k(x) = \delta_l(x)\), which defines a linear hyperplane. 该函数关于 \(x\)线性的,因此该方法被称为线性判别分析。任意两个类别(例如类别 \(k\) 和类别 \(l\))之间的决策边界是满足 \(\delta_k(x) = \delta_l(x)\) 的点的集合,这定义了一个线性超平面。

The image above (from your “Graph of LDA” slide) is very important. * Left: The ellipses show the true 95% probability contours for three Gaussian classes. The dashed lines are the ideal Bayes decision boundaries, which are perfectly linear because the assumption of common covariance holds. * Right: This shows a sample of data points drawn from those distributions. The solid lines are the LDA decision boundaries calculated from the sample. They are a very good estimate of the ideal boundaries. 上图(来自您的“LDA 图”幻灯片)非常重要。 * 左图:椭圆显示了三个高斯类别的真实 95% 概率轮廓。虚线是理想的贝叶斯决策边界,由于共同协方差假设成立,因此它们是完美的线性。 * 右图:这显示了从这些分布中抽取的数据点样本。实线是根据样本计算出的 LDA 决策边界。它们是对理想边界的非常好的估计。 ***

Practical Implementation: Estimating the Parameters 实际应用:估计参数

In a real-world scenario, we don’t know the true parameters (\(\mu_k\), \(\Sigma\), \(\pi_k\)). Instead, we estimate them from our training data (\(n\) total samples, with \(n_k\) samples in class \(k\)). 在实际场景中,我们不知道真正的参数(\(\mu_k\)\(\Sigma\)\(\pi_k\))。相反,我们根据训练数据(\(n\) 个样本,\(n_k\) 个样本属于 \(k\) 类)来估计它们。

  • Prior Probability (\(\hat{\pi}_k\)): The proportion of training samples in class \(k\). \[\hat{\pi}_k = \frac{n_k}{n}\]
  • Class Mean (\(\hat{\mu}_k\)): The average of the training samples in class \(k\). \[\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i=k} x_i\]
  • Common Covariance (\(\hat{\Sigma}\)): A weighted average of the sample covariance matrices for each class. This is often called the “pooled” covariance. \[\hat{\Sigma} = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T\]
  • 先验概率 (\(\hat{\pi}_k\)):训练样本在 \(k\) 类中的比例。 \[\hat{\pi}_k = \frac{n_k}{n}\]
  • 类别均值 (\(\hat{\mu}_k\)):训练样本在 \(k\) 类中的平均值。 \[\hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i=k} x_i\]
  • 公共协方差 (\(\hat{\Sigma}\)):每个类的样本协方差矩阵的加权平均值。这通常被称为“合并”协方差。 \[\hat{\Sigma} = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T\]

We then plug these estimates into the discriminant function to get \(\hat{\delta}_k(x)\) and classify a new observation \(x\) to the class with the largest score. 然后,我们将这些估计值代入判别函数,得到 \(\hat{\delta}_k(x)\),并将新的观测值 \(x\) 归类到得分最高的类别。 ***

Evaluating Performance

After training the model, we evaluate its performance using a confusion matrix. 训练模型后,我们使用混淆矩阵来评估其性能。

This matrix shows the true classes versus the predicted classes. * Diagonal elements (9644, 81) are correct predictions. * Off-diagonal elements (23, 252) are errors. 该矩阵显示了真实类别与预测类别的对比。 * 对角线元素 (9644, 81) 表示正确预测。 * 非对角线元素 (23, 252) 表示错误预测。

From this matrix, we can calculate key metrics: * Overall Error Rate: Total incorrect predictions / Total predictions. * Example: \((252 + 23) / 10000 = 2.75\%\) * Sensitivity (True Positive Rate): Correctly predicted positives / Total actual positives. It answers: “Of all the people who actually defaulted, what fraction did we catch?” * Example: \(81 / 333 = 24.3\%\). The sensitivity is \(1 - 75.7\% = 24.3\%\). * Specificity (True Negative Rate): Correctly predicted negatives / Total actual negatives. It answers: “Of all the people who did not default, what fraction did we correctly identify?” * Example: \(9644 / 9667 = 99.8\%\). The specificity is \(1 - 0.24\% = 99.8\%\).

The example in your slides shows a high error rate for “default” people (75.7%) because the classes are unbalanced—there are far fewer defaulters. This highlights the importance of looking at class-specific metrics, not just the overall error rate.


Python Code Understanding

In Python, you can easily implement LDA using the scikit-learn library. The code conceptually mirrors the steps we discussed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# Assume you have your data X (features) and y (labels)
# X = features (e.g., balance, income)
# y = labels (e.g., 0 for 'no-default', 1 for 'default')

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Create an instance of the LDA model
lda = LinearDiscriminantAnalysis()

# 3. Fit the model to the training data
# This is where the model calculates the estimates:
# - Prior probabilities (pi_k)
# - Class means (mu_k)
# - Pooled covariance matrix (Sigma)
lda.fit(X_train, y_train)

# 4. Make predictions on new, unseen data
predictions = lda.predict(X_test)

# 5. Evaluate the model's performance
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

print("\nClassification Report:")
print(classification_report(y_test, predictions))
  • LinearDiscriminantAnalysis() creates the classifier object.
  • lda.fit(X_train, y_train) is the core training step where the model learns the \(\hat{\pi}_k\), \(\hat{\mu}_k\), and \(\hat{\Sigma}\) parameters from the data.
  • lda.predict(X_test) uses the learned discriminant function \(\hat{\delta}_k(x)\) to classify each sample in the test set.
  • confusion_matrix and classification_report are tools to evaluate the results, just like in the slides.

6. Here is a summary of the provided slides on Linear Discriminant Analysis (LDA), focusing on mathematical concepts, Python code interpretation, and key visuals.

Core Concept: LDA for Classification

Linear Discriminant Analysis (LDA) is a classification method that models the probability that an observation belongs to a certain class. It works by finding a linear combination of features that best separates two or more classes.

The decision is based on Bayes’ theorem. For a given observation with features \(X=x\), LDA calculates the posterior probability, \(p_k(x) = Pr(Y=k|X=x)\), for each class \(k\). This is the probability that the observation belongs to class \(k\) given its features. 线性判别分析 (LDA) 是一种分类方法,它对观测值属于某个类别的概率进行建模。它的工作原理是找到能够最好地区分两个或多个类别的特征的线性组合。

该决策基于贝叶斯定理。对于特征为 \(X=x\) 的给定观测值,LDA 会计算每个类别 \(k\)后验概率\(p_k(x) = Pr(Y=k|X=x)\)。这是给定观测值的特征后,该观测值属于类别 \(k\) 的概率。

By default, the Bayes classifier assigns an observation to the class with the highest posterior probability. For a binary (two-class) problem like ‘Yes’ vs. ‘No’, this means: 默认情况下,贝叶斯分类器将观测值分配给后验概率最高的类别。对于像“是”与“否”这样的二分类问题,这意味着:

  • Assign to ‘Yes’ if \(Pr(Y=\text{Yes}|X=x) > 0.5\)
  • Assign to ‘No’ otherwise

Modifying the Decision Threshold

The default 0.5 threshold isn’t always optimal. In many real-world scenarios, the cost of one type of error is much higher than another. For example, in credit card default prediction: 默认的 0.5 阈值并非总是最优的。在许多实际场景中,一种错误的代价远高于另一种。例如,在信用卡违约预测中:

  • False Negative: Incorrectly classifying a person who will default as someone who won’t. (The bank loses money).
  • False Positive: Incorrectly classifying a person who won’t default as someone who will. (The bank loses a potential customer).

A bank might decide that missing a defaulter is much worse than denying a good customer. To catch more potential defaulters, they can lower the probability threshold. 银行可能会认为错过一个违约者比拒绝一个优质客户更糟糕。为了捕捉更多潜在的违约者,他们可以降低概率阈值

A modified rule could be: \[ Pr(\text{default}=\text{Yes}|X=x) > 0.2 \] This makes the model more “sensitive” to flagging potential defaulters, even at the cost of misclassifying more non-defaulters. 降低阈值会提高敏感度,但会降低特异性

This decision leads to a trade-off between two key performance metrics: * Sensitivity (True Positive Rate): The ability to correctly identify positive cases. (e.g., Correctly identified defaulters / Total actual defaulters). * Specificity (True Negative Rate): The ability to correctly identify negative cases. (e.g., Correctly identified non-defaulters / Total actual non-defaulters).

这一决策会导致两个关键绩效指标之间的权衡: * 敏感度(真阳性率):正确识别阳性案例的能力。(例如,“正确识别的违约者/实际违约者总数”)。 * 特异性(真阴性率):正确识别阴性案例的能力。(例如,“正确识别的非违约者/实际非违约者总数”)。

Lowering the threshold increases sensitivity but decreases specificity. ## Python Code Explained

The slides show how to implement and adjust LDA using Python’s scikit-learn library.

Basic LDA Implementation

1
2
3
4
5
6
7
8
9
# Import the necessary library
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Initialize and train the LDA model
lda = LinearDiscriminantAnalysis()
lda_train = lda.fit(X, y)

# Get predictions using the default 0.5 threshold
y_pred = lda.predict(X)

This code trains an LDA model and makes predictions using the standard 50% probability boundary.

Adjusting the Prediction Threshold

To use a custom threshold (e.g., 0.2), you don’t use the .predict() method. Instead, you get the class probabilities with .predict_proba() and apply the threshold manually.

1
2
3
4
5
6
7
8
9
10
11
12
# 1. Get the probabilities for each class
# lda.predict_proba(X) returns an array like [[P(No), P(Yes)], ...]
# We select the second column [:, 1] for the 'Yes' class probability
lda_probs = lda.predict_proba(X)[:, 1]

# 2. Define a custom threshold
threshold = 0.2

# 3. Apply the threshold to get new predictions
# This creates a boolean array (True where prob > 0.2, else False)
# We then convert True/False to 'Yes'/'No' labels
lda_pred1 = np.where(lda_probs > threshold, "Yes", "No")

This is the core technique for tuning the classifier’s behavior to meet specific business needs, as demonstrated on slides 55 and 56 for both LDA and Logistic Regression.

Important Images to Understand

  1. Confusion Matrix (Slide 49): This table is crucial. It breaks down the model’s predictions into True Positives, True Negatives, False Positives, and False Negatives. All key metrics like error rate, sensitivity, and specificity are calculated from this matrix. 混淆矩阵(幻灯片 49):这张表至关重要。它将模型的预测分解为真阳性、真阴性、假阳性和假阴性。所有关键指标,例如错误率、灵敏度和特异性,都基于此矩阵计算得出。
  2. LDA Decision Boundaries (Slide 51): This plot provides a powerful visual intuition. It shows the data points for two classes and the decision boundary line. The different parallel lines show how changing the threshold from 0.5 to 0.1 or 0.9 shifts the boundary, making the model classify more or fewer points into the minority class. LDA 决策边界(幻灯片 51):这张图提供了强大的视觉直观性。它展示了两个类别的数据点和决策边界线。不同的平行线显示了将阈值从 0.5 更改为 0.1 或 0.9 时边界如何移动,从而使模型将更多或更少的点归入少数类
  3. Error Rate Tradeoff Curve (Slide 53): This graph is the most important for understanding the business implication of changing the threshold. It clearly shows that as the threshold changes, the error rate for one class goes down while the error rate for the other goes up. The overall error is minimized at a certain point, but that may not be the optimal point from a business perspective. 错误率权衡曲线(幻灯片 53):这张图对于理解更改阈值的业务含义至关重要。它清楚地表明,随着阈值的变化,一个类别的错误率下降,而另一个类别的错误率上升。总体误差在某个点达到最小,但从业务角度来看,这可能并非最佳点。
  4. ROC Curve (Slides 54 & 55): The Receiver Operating Characteristic (ROC) curve plots Sensitivity vs. (1 - Specificity) for all possible thresholds. An ideal classifier has a curve that “hugs” the top-left corner, indicating high sensitivity and high specificity. It’s a standard way to visualize and compare the overall performance of different classifiers. ROC 曲线(幻灯片 54 和 55): 接收者操作特性 (ROC) 曲线绘制了所有可能阈值的灵敏度与(1 - 特异性)的关系。理想的分类器曲线“紧贴”左上角,表示高灵敏度和高特异性。这是可视化和比较不同分类器整体性能的标准方法。

7. Here is a summary of the provided slides on Linear and Quadratic Discriminant Analysis, including the key formulas, Python code equivalents, and explanations of the important concepts.

Key Goal: Classification

Both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are classification algorithms. Their main goal is to find a decision boundary to separate different classes (e.g., “default” vs. “not default”) in the data. 线性判别分析 (LDA)二次判别分析 (QDA) 都是分类算法。它们的主要目标是找到一个决策边界来区分数据中的不同类别(例如,“默认”与“非默认”)。

## Linear Discriminant Analysis (LDA)

LDA creates a linear decision boundary between classes. LDA 在类别之间创建线性决策边界。

Core Idea (Fisher’s Interpretation)

Imagine you have data points for different classes in a 3D space. Fisher’s idea is to find the best angle to shine a “flashlight” on the data to project its shadow onto a 2D wall (or a 1D line). The “best” projection is the one where the shadows of the different classes are as far apart from each other as possible, while the shadows within each class are as tightly packed as possible. 想象一下,你在三维空间中拥有不同类别的数据点。Fisher 的思想是找到最佳角度,用“手电筒”照射数据,将其阴影投射到二维墙壁(或一维线上)。 “最佳”投影是不同类别的阴影彼此之间尽可能远,而每个类别内的阴影尽可能紧密的投影。

  • Maximize: The distance between the means of the projected classes (Between-Class Variance). 投影类别均值之间的距离(类间方差)。
  • Minimize: The spread or variance within each projected class (Within-Class Variance). 每个投影类别内的扩散或方差(类内方差)。 This is the most important image for understanding the intuition behind LDA. It shows how projecting the data onto a specific line (defined by vector w) can make the two classes clearly separable. 这是理解LDA背后直觉的最重要图像。它展示了如何将数据投影到特定直线(由向量“w”定义)上,从而使两个类别清晰可分。

Key Mathematical Formulas

To achieve this, LDA maximizes a ratio called the Rayleigh quotient. LDA最大化一个称为瑞利商的比率。

  1. Within-Class Covariance (\(\hat{\Sigma}_W\)): Measures the spread of data inside each class. 类内协方差 (\(\hat{\Sigma}_W\)):衡量每个类别内部数据的扩散程度。 \[\hat{\Sigma}_W = \frac{1}{n-K} \sum_{k=1}^{K} \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^\top\]
  2. Between-Class Covariance (\(\hat{\Sigma}_B\)): Measures the spread between the means of different classes. 类间协方差 (\(\hat{\Sigma}_B\)):衡量不同类别均值之间的差异。 \[\hat{\Sigma}_B = \sum_{k=1}^{K} n_k (\hat{\mu}_k - \hat{\mu})(\hat{\mu}_k - \hat{\mu})^\top\]
  3. Objective Function: Find the projection vector \(w\) that maximizes the ratio of between-class variance to within-class variance. 目标函数:找到投影向量 \(w\),使类间方差与类内方差之比最大化。 \[\max_w \frac{w^\top \hat{\Sigma}_B w}{w^\top \hat{\Sigma}_W w}\]

LDA’s Main Assumption

The key assumption of LDA is that all classes share the same covariance matrix (\(\Sigma\)). They can have different means (\(\mu_k\)), but their spread and orientation must be identical. This assumption is what results in a linear decision boundary. LDA 的关键假设是所有类别共享相同的协方差矩阵 (\(\Sigma\))。它们可以具有不同的均值 (\(\mu_k\)),但它们的散度和方向必须相同。正是这一假设导致了线性决策边界。

## Quadratic Discriminant Analysis (QDA)

QDA is a more flexible extension of LDA that creates a quadratic (curved) decision boundary. QDA 是 LDA 的更灵活的扩展,它创建了二次(曲线)决策边界。 #### Core Idea & Key Assumption

QDA starts with the same principles as LDA but drops the key assumption. QDA assumes that each class has its own unique covariance matrix (\(\Sigma_k\)). QDA 的原理与 LDA 相同,但放弃了关键假设。QDA 假设每个类别都有自己独特的协方差矩阵 (\(\Sigma_k\))

This means each class can have its own spread, shape, and orientation. This additional flexibility allows for a more complex, curved decision boundary. 这意味着每个类别可以拥有自己的散度、形状和方向。这种额外的灵活性使得决策边界更加复杂、曲线化。

Key Mathematical Formula

The classification is made using a discrimination function, \(\delta_k(x)\). We assign a data point \(x\) to the class \(k\) for which \(\delta_k(x)\) is largest. The function for QDA is: \[\delta_k(x) = -\frac{1}{2}(x - \mu_k)^\top \Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}\log(|\Sigma_k|) + \log \pi_k\] The term containing \(x^\top \Sigma_k^{-1} x\) makes this function a quadratic function of \(x\).

## LDA vs. QDA: The Trade-Off

The choice between LDA and QDA is a classic bias-variance trade-off. 在 LDA 和 QDA 之间进行选择是典型的偏差-方差权衡

  • Use LDA when:

    • The assumption of a common covariance matrix is reasonable (the classes have similar shapes).
    • You have a small amount of training data, as LDA is less prone to overfitting.
    • Simplicity is preferred. LDA is less flexible (high bias) but has lower variance.
    • 假设共同协方差矩阵是合理的(类别具有相似的形状)。
    • 训练数据量较少,因为 LDA 不易过拟合。
    • 简洁是首选。LDA 灵活性较差(偏差较大),但方差较小。
  • Use QDA when:

    • The classes have clearly different shapes and spreads (different covariance matrices).
    • You have a large amount of training data to properly estimate the separate covariance matrices for each class.
    • QDA is more flexible (low bias) but can have high variance, meaning it might overfit on smaller datasets.
    • 类别具有明显不同的形状和分布(不同的协方差矩阵)。
    • 拥有大量训练数据,可以正确估计每个类别的独立协方差矩阵。
    • QDA 更灵活(偏差较小),但方差较大,这意味着它可能在较小的数据集上过拟合。 Rule of Thumb: If the class variances are equal or close, LDA is better. Otherwise, QDA is better. 经验法则: 如果类别方差相等或接近,则 LDA 更佳。否则,QDA 更好。

## Code Understanding (Python Equivalent)

The slides show code in R. Here’s how you would perform LDA and evaluate it in Python using the popular scikit-learn library.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, accuracy_score, roc_curve, auc
import matplotlib.pyplot as plt

# Assume 'df' is your DataFrame with features and a 'target' column
# X = df.drop('target', axis=1)
# y = df['target']

# 1. Split data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Fit an LDA model (equivalent to lda() in R)
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# 3. Make predictions (equivalent to predict() in R)
y_pred_lda = lda.predict(X_test)

# To fit a QDA model, the process is identical:
# qda = QuadraticDiscriminantAnalysis()
# qda.fit(X_train, y_train)
# y_pred_qda = qda.predict(X_test)

# 4. Create a confusion matrix (equivalent to table())
print("LDA Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lda))

# 5. Plot the ROC Curve (equivalent to the R code for ROC)
# Get prediction probabilities for the positive class
y_pred_proba = lda.predict_proba(X_test)[:, 1]

# Calculate ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Calculate Area Under the Curve (AUC)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'LDA ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--') # Random guess line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

Understanding the ROC Curve

The ROC Curve is another important image. It helps you visualize a classifier’s performance across all possible classification thresholds. ROC 曲线 是另一个重要的图像。它可以帮助您直观地了解分类器在所有可能的分类阈值下的性能。

  • The Y-axis is the True Positive Rate (Sensitivity): “Of all the actual positives, how many did we correctly identify?”
  • The X-axis is the False Positive Rate: “Of all the actual negatives, how many did we incorrectly label as positive?”
  • A perfect classifier would have a curve that goes straight up to the top-left corner (100% TPR, 0% FPR). The diagonal line represents a random guess. The Area Under the Curve (AUC) summarizes the model’s performance; a value closer to 1.0 is better.
  • Y 轴 表示真阳性率(敏感度):“在所有实际的阳性样本中,我们正确识别了多少个?”
  • X 轴 表示假阳性率:“在所有实际的阴性样本中,我们错误地将多少个标记为阳性?”
  • 一个完美的分类器应该有一条直线上升到左上角的曲线(真阳性率 100%,假阳性率 0%)。对角线表示随机猜测。曲线下面积 (AUC) 概括了模型的性能;该值越接近 1.0 越好。

8. Here is a summary of the provided slides on Quadratic Discriminant Analysis (QDA), including the key formulas, code explanations with Python equivalents, and a guide to the most important images.

## Core Concept: QDA vs. LDA

The main difference between Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) lies in their assumptions about the data. 线性判别分析 (LDA)二次判别分析 (QDA) 的主要区别在于它们对数据的假设。 * LDA assumes that all classes share the same covariance matrix (\(\Sigma\)). It models each class as a normal distribution with a different mean (\(\mu_k\)) but the same shape and orientation. This results in a linear decision boundary between classes. 假设所有类别共享相同的协方差矩阵 (\(\Sigma\))。它将每个类别建模为均值不同 (\(\mu_k\)) 但形状和方向相同的正态分布。这会导致类别之间出现 线性 决策边界。 * QDA is more flexible. It assumes that each class \(k\) has its own, separate covariance matrix (\(\Sigma_k\)). This allows each class’s distribution to have a unique shape, size, and orientation. This flexibility results in a quadratic decision boundary (like a parabola, hyperbola, or ellipse). 更灵活。它假设每个类别 \(k\) 都有其独立的协方差矩阵 (\(\Sigma_k\))。这使得每个类别的分布具有独特的形状、大小和方向。这种灵活性导致了二次决策边界(类似于抛物线、双曲线或椭圆)。 Analogy 💡: Imagine you’re drawing boundaries around different clusters of stars. LDA gives you only straight lines to separate the clusters. QDA gives you curved lines (circles, ellipses), which can create a much better fit if the clusters themselves are elliptical and point in different directions. 想象一下,你正在围绕不同的星团绘制边界。LDA 只提供直线来分隔星团。QDA 提供曲线(圆形、椭圆形),如果星团本身是椭圆形且指向不同的方向,则可以产生更好的拟合效果。

## The Math Behind QDA

QDA classifies a new observation \(x\) to the class \(k\) that has the highest discriminant score, \(\delta_k(x)\). The formula for this score is what makes the boundary quadratic. QDA 将新的观测值 \(x\) 归类到具有最高判别分数 \(\delta_k(x)\) 的类 \(k\) 中。该分数的公式使得边界具有二次项。

The discriminant function for class \(k\) is: \[\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}\log(|\Sigma_k|) + \log(\pi_k)\]

Let’s break it down:

  • \((x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k)\): This is a quadratic term (since it involves \(x^T \Sigma_k^{-1} x\)). It measures the squared Mahalanobis distance from \(x\) to the class mean \(\mu_k\), scaled by that class’s specific covariance \(\Sigma_k\).
  • \(\log(|\Sigma_k|)\): A term that penalizes classes with larger variance.
  • \(\log(\pi_k)\): The prior probability of class \(k\). This is our initial belief about how likely class \(k\) is, before seeing the data.
    • \((x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k)\):这是一个二次项(因为它涉及 \(x^T \Sigma_k^{-1} x\))。它测量从 \(x\) 到类均值 \(\mu_k\) 的平方马氏距离,并根据该类的特定协方差 \(\Sigma_k\) 进行缩放。
    • \(\log(|\Sigma_k|)\):用于惩罚方差较大的类的项。
    • \(\log(\pi_k)\):类 \(k\) 的先验概率。这是我们在看到数据之前对类 \(k\) 可能性的初始信念。 Because each class \(k\) has its own \(\Sigma_k\), the quadratic term doesn’t cancel out when comparing scores between classes, leading to a quadratic boundary. 由于每个类 \(k\) 都有其自己的 \(\Sigma_k\),因此在比较类之间的分数时,二次项不会抵消,从而导致二次边界。 Key Trade-off:
  • If the class variances (\(\Sigma_k\)) are truly different, QDA is better.
  • If the class variances are similar, LDA is often better because it’s less flexible and less likely to overfit, especially with a small number of training samples.
  • 如果类方差 (\(\Sigma_k\)) 确实不同,QDA 更好
  • 如果类方差相似,LDA 通常更好,因为它的灵活性较差,并且不太可能过拟合,尤其是在训练样本数量较少的情况下。

## Code Implementation: R and Python

The slides provide R code for fitting a QDA model and evaluating it. Below is an explanation of the R code and its equivalent in Python using the popular scikit-learn library.

R Code (from the slides)

The code uses the MASS library for QDA and the ROCR library for evaluation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# ######## QDA ##########

# 1. Fit the model on the training data
# This formula `Default~.` means "predict 'Default' using all other variables".
qda.fit.mod2 <- qda(Default~., data=Default, subset=train.ids)

# 2. Make predictions on the test data
# We are interested in the posterior probabilities for the ROC curve
qda.fit.pred3 <- predict(qda.fit.mod2, Default_test)$posterior[,2]

# 3. Evaluate using ROC and AUC
# 'prediction' and 'performance' are functions from the ROCR library
perf <- performance(prediction(qda.fit.pred3, Default_test$Default), "auc")

# 4. Get the AUC value
auc_value <- perf@y.values[[1]]
# Result from slide: 0.9638683

Python Equivalent (scikit-learn)

Here’s how you would perform the same steps in Python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Assume 'Default' is your DataFrame and 'default' is the target column
# (preprocessing 'student' and 'default' columns to numbers)
# Default['default_num'] = Default['default'].apply(lambda x: 1 if x == 'Yes' else 0)
# X = Default[['balance', 'income', ...]]
# y = Default['default_num']

# 1. Split data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# 2. Initialize and fit the QDA model
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

# 3. Predict probabilities on the test set
# We need the probability of the positive class ('Yes') for the AUC calculation
y_pred_proba = qda.predict_proba(X_test)[:, 1]

# 4. Calculate the AUC score
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"AUC Score for QDA: {auc_score:.7f}")

# You can also plot the ROC curve
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
# plt.plot(fpr, tpr)
# plt.show()

## Model Evaluation: ROC and AUC

The slides correctly emphasize using the ROC curve and the Area Under the Curve (AUC) to compare model performance.

  • ROC Curve (Receiver Operating Characteristic): This plot shows how well a model can distinguish between two classes. It plots the True Positive Rate (y-axis) against the False Positive Rate (x-axis) at all possible classification thresholds. A better model has a curve that is closer to the top-left corner.

  • AUC (Area Under the Curve): This is a single number that summarizes the entire ROC curve.

    • AUC = 1: Perfect classifier.
    • AUC = 0.5: A useless classifier (equivalent to random guessing).
    • AUC > 0.7: Generally considered an acceptable model.
  • ROC 曲线(接收者操作特征):此图显示了模型区分两个类别的能力。它绘制了所有可能的分类阈值下的 真阳性率(y 轴)与 假阳性率(x 轴)的对比图。更好的模型的曲线越靠近左上角,效果就越好。

    • AUC(曲线下面积):这是一个概括整个 ROC 曲线的数值。

    • AUC = 1:完美的分类器。

    • AUC = 0.5:无用的分类器(相当于随机猜测)。

    • AUC > 0.7:通常被认为是可接受的模型。

The slides show that for the Default dataset, LDA’s AUC (0.9647) was slightly higher than QDA’s (0.9639). This suggests that the assumption of a common covariance matrix (LDA) was a slightly better fit for this particular test set, possibly because QDA’s extra flexibility wasn’t needed and it may have slightly overfit the training data. 这表明,对于这个特定的测试集,公共协方差矩阵 (LDA) 的假设拟合度略高,可能是因为 QDA 的额外灵活性并非必需,并且可能对训练数据略微过拟合。

## Key Takeaways and Important Images

Here’s a ranking of the most important visual aids in your slides:

  1. Slide 68/69 (Model Assumption & Formula): These are the most critical slides. They present the core theoretical difference between LDA and QDA and provide the mathematical foundation (the discriminant function formula). Understanding these is key to understanding QDA.

  2. Slide 73 (ROC Comparison): This is the most important image for practical evaluation. It visually compares the performance of LDA and QDA side-by-side, making it easy to see which one performs better on this specific dataset. The concept of AUC is introduced here as the method for comparison.

  3. Slide 71 (Decision Boundaries with Different Thresholds): This is an excellent conceptual image. It shows how the quadratic decision boundary (the curved lines) separates the data points. It also illustrates how changing the probability threshold (from 0.1 to 0.5 to 0.9) shifts the boundary, trading off between precision and recall.

Of course. Here is a summary of the remaining slides, which compare QDA to other popular classification models like Logistic Regression and K-Nearest Neighbors (KNN).


Visualizing the Core Trade-off: LDA vs. QDA

This is the most important concept in these slides. The choice between LDA and QDA depends entirely on the underlying structure of your data.

The slide shows two scenarios: 1. Left Plot (\(\Sigma_1 = \Sigma_2\)): When the true covariance matrices of the classes are the same, the optimal decision boundary (the Bayes classifier) is a straight line. LDA, which assumes equal covariances, creates a linear boundary that approximates this optimal boundary very well. QDA’s flexible, curved boundary is unnecessarily complex and might overfit the training data. In this case, LDA is better. 2. Right Plot (\(\Sigma_1 \neq \Sigma_2\)): When the true covariance matrices are different, the optimal decision boundary is a curve. QDA’s quadratic model can capture this non-linearity much better than LDA’s rigid linear model. In this case, QDA is better.

This perfectly illustrates the bias-variance tradeoff. LDA has higher bias (it’s less flexible) but lower variance. QDA has lower bias (it’s more flexible) but higher variance.


Comparing Performance on the “Default” Dataset

The slides compare four different models on the same classification task. Let’s look at their performance using the Area Under the Curve (AUC), where a higher score is better.

  • LDA AUC: 0.9647
  • QDA AUC: 0.9639
  • Logistic Regression AUC: 0.9645
  • K-Nearest Neighbors (KNN): The plot shows test error vs. K. The error is lowest around K=4, but it’s not directly converted to an AUC score in the slides.

Interestingly, for this particular dataset, LDA, QDA, and Logistic Regression perform almost identically. This suggests that the decision boundary for this problem is likely very close to linear, meaning the extra flexibility of QDA isn’t providing much benefit.


Pros and Cons: Which Model to Choose?

The final slide asks for a comparison of the models. Here’s a summary of their key characteristics:

Model Type Decision Boundary Key Pro Key Con
Logistic Regression Parametric Linear Highly interpretable, no strong assumptions about data distribution. Inflexible; cannot capture non-linear relationships.
Linear Discriminant Analysis (LDA) Parametric Linear More stable than Logistic Regression when classes are well-separated. Assumes data is normally distributed with equal covariance matrices for all classes.
Quadratic Discriminant Analysis (QDA) Parametric Quadratic (Curved) More flexible than LDA; can model non-linear boundaries. Requires more data to estimate parameters and is more prone to overfitting. Assumes normality.
K-Nearest Neighbors (KNN) Non-Parametric Highly Non-linear Extremely flexible; makes no assumptions about the data’s distribution. Can be slow on large datasets and suffers from the “curse of dimensionality.” Less interpretable.

Summary of the Comparison:

  • Linear Models (Logistic Regression & LDA): Choose these for simplicity, interpretability, and when you believe the relationship between predictors and the class is linear. LDA often outperforms Logistic Regression if its normality assumptions are met.
  • Non-Linear Models (QDA & KNN): Choose these when the decision boundary is likely more complex. QDA is a good middle ground, offering more flexibility than LDA without being as completely data-driven as KNN. KNN is the most flexible but requires careful tuning of the parameter K to avoid overfitting or underfitting.

9. Here is a more detailed, slide-by-slide analysis of the presentation.

4.6 Four Classification Methods: Comparison by Simulation

This section (slides 81-87) introduces four classification methods and systematically compares their performance on six different simulated datasets. The goal is to see which method works best under different conditions (e.g., linear vs. non-linear boundaries, normal vs. non-normal data).

The four methods being compared are: * Logistic Regression: A linear method that models the log-odds as a linear function of the predictors. * Linear Discriminant Analysis (LDA): Another linear method. It also assumes a linear decision boundary but makes stronger assumptions than logistic regression (e.g., that data within each class is normally distributed with a common covariance matrix). * Quadratic Discriminant Analysis (QDA): A non-linear method. It assumes the log-odds are a quadratic function, which creates a more flexible, curved decision boundary. It assumes data within each class is normally distributed, but without a common covariance matrix. * K-Nearest Neighbors (KNN): A non-parametric, highly flexible method. Two versions are tested: * KNN-1 (\(K=1\)): A very flexible (high variance) model. * KNN-CV: A tuned model where the best \(K\) is chosen via cross-validation.

比较的四种方法是: * 逻辑回归:一种将对数概率建模为预测变量线性函数的线性方法。 * 线性判别分析 (LDA):另一种线性方法。它也假设线性决策边界,但比逻辑回归做出更强的假设(例如,每个类中的数据呈正态分布,且具有共同的协方差矩阵)。 * 二次判别分析 (QDA):一种非线性方法。它假设对数概率为二次函数,从而创建一个更灵活、更弯曲的决策边界。它假设每个类中的数据服从正态分布,但没有共同的协方差矩阵。 * K最近邻 (KNN):一种非参数化、高度灵活的方法。测试了两个版本: * KNN-1 (\(K=1\)):一个非常灵活(高方差)的模型。 * KNN-CV:一个经过调整的模型,通过交叉验证选择最佳的\(K\)

Analysis of Simulation Scenarios

The performance is measured by the test error rate (lower is better), shown in the boxplots for each scenario. 性能通过测试错误率(越低越好)来衡量,每个场景的箱线图都显示了该错误率。

  • Scenario 1 (Slide 82):
    • Setup: A linear decision boundary. Data is normally distributed with uncorrelated predictors.
    • Result: LDA and Logistic Regression perform best. Their test error rates are low and similar. This is expected, as the setup perfectly matches their core assumption (linear boundary). QDA is slightly worse because its extra flexibility (being quadratic) is unnecessary. KNN-1 is the worst, as its high flexibility leads to high variance (overfitting).
    • 结果: LDA 和逻辑回归表现最佳。它们的测试错误率较低且相似。这是意料之中的,因为设置完全符合它们的核心假设(线性边界)。QDA 略差,因为其额外的灵活性(二次方)是不必要的。KNN-1 最差,因为其高灵活性导致方差较大(过拟合)。
  • Scenario 2 (Slide 83):
    • Setup: Same as Scenario 1 (linear boundary, normal data), but now the two predictors have a correlation of 0.5.
    • Result: Almost no change from Scenario 1. LDA and Logistic Regression are still the best. This shows that these linear methods are robust to correlation between predictors.
    • 结果:与场景 1 相比几乎没有变化LDA 和逻辑回归仍然是最佳。这表明这些线性方法对预测因子之间的相关性具有鲁棒性。
  • Scenario 3 (Slide 84):
    • Setup: A linear decision boundary, but the data is drawn from a t-distribution (which is non-normal and has “heavy tails,” or more extreme outliers).
    • Result: Logistic Regression is the clear winner. LDA’s performance gets worse because its assumption of normality is violated by the t-distribution. QDA’s performance deteriorates significantly due to the non-normality. This highlights a key difference: logistic regression is more robust to violations of the normality assumption.
    • 结果:逻辑回归明显胜出**。LDA 的性能会变差,因为 t 分布违反了其正态性假设。QDA 的性能由于非正态性而显著下降。这凸显了一个关键区别:逻辑回归对违反正态性假设的情况更稳健。
  • Scenario 4 (Slide 85):
    • Setup: A quadratic decision boundary. Data is normally distributed with different correlations in each class.
    • Result: QDA is the clear winner by a large margin. This setup perfectly matches QDA’s assumption (quadratic boundary from normal data with different covariance structures). All other methods (LDA, Logistic, KNN) are linear or not flexible enough, so they perform poorly.
    • 结果:QDA 明显胜出**,且遥遥领先。此设置完全符合 QDA 的假设(来自具有不同协方差结构的正态数据的二次边界)。所有其他方法(LDA、Logistic、KNN)都是线性的或不够灵活,因此性能不佳。
  • Scenario 5 (Slide 86):
    • Setup: Another quadratic boundary, but generated in a different way (using a logistic function of quadratic terms).
    • Result: QDA performs best again, closely followed by the flexible KNN-CV. The linear methods (LDA, Logistic) have poor performance because they cannot capture the curve.
    • 结果:QDA 再次表现最佳,紧随其后的是灵活的KNN-CV。线性方法(LDA、Logistic)性能较差,因为它们无法捕捉曲线。
  • Scenario 6 (Slide 87):
    • Setup: A complex, non-linear decision boundary (more complex than a simple quadratic curve).
    • Result: The flexible KNN-CV method is the winner. Its non-parametric nature allows it to approximate the complex shape. QDA is not flexible enough and performs worse. This slide highlights the bias-variance trade-off: the overly simple KNN-1 is the worst, but the tuned KNN-CV is the best.
    • 结果:灵活的 KNN-CV 方法胜出**。其非参数特性使其能够近似复杂的形状。 QDA 不够灵活,性能较差。这张幻灯片重点介绍了偏差-方差权衡:过于简单的 KNN-1 最差,而 调整后的 KNN-CV 最好。

4.7 R Example on Smarket Data

This section (slides 88-93) applies Logistic Regression and LDA to the Smarket dataset from the ISLR package to predict the stock market’s Direction (Up or Down). 本节(幻灯片 88-93)将逻辑回归和 LDA 应用于“ISLR”包中的“Smarket”数据集,以预测股市的“方向”(上涨或下跌)。 ### Data Preparation (Slides 88, 89, 90)

  1. Load Data: The ISLR library is loaded, and the Smarket dataset is explored. It contains daily percentage returns (Lag1Lag5 for the previous 5 days, Today), Volume, and the Year.
  2. Explore Data: A correlation matrix (cor(Smarket[,-9])) is computed, and a plot of Volume over time is generated.
  3. Split Data: The data is split into a training set (Years 2001-2004) and a test set (Year 2005).
    • train <- (Year<2005)
    • Smarket.2005 <- Smarket[!train,]
    • Direction.2005 <- Direction[!train]
    • The test set has 252 observations.
  4. 加载数据:加载“ISLR”库,并探索“Smarket”数据集。该数据集包含每日百分比收益率(前 5 天的“Lag1”…“Lag5”,“今日”)、“成交量”和“年份”。
  5. 探索数据:计算相关矩阵 (cor(Smarket[,-9])),并生成“成交量”随时间变化的图表。
  6. 拆分数据:将数据拆分为训练集(年份 2001-2004)和测试集(年份 2005)。
    • train <- (Year<2005)
    • Smarket.2005 <- Smarket[!train,]
    • Direction.2005 <- Direction[!train]
    • 测试集包含 252 个观测值。

Model 1: Logistic Regression (All Predictors) (Slide 90)

  • Model: A logistic regression model is fit on the training data using all predictors.
    • glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family=binomial, subset=train)
  • Prediction: The model is used to predict the direction for the 2005 test data.
    • glm.probs <- predict(glm.fit, Smarket.2005, type="response")
    • A threshold of 0.5 is used to classify: if \(P(\text{Up}) > 0.5\), predict “Up”.
  • Results:
    • Test Error Rate: 0.5198 (or 48.0% accuracy).
    • Conclusion: This is “not good!”—it’s worse than flipping a coin. This suggests the model is either too complex or the predictors are not useful.

Model 2: Logistic Regression (Lag1 & Lag2) (Slide 91)

  • Model: Based on the poor results, a simpler model is tried, using only Lag1 and Lag2.
    • glm.fit <- glm(Direction ~ Lag1 + Lag2, data=Smarket, family=binomial, subset=train)
  • Prediction: Predictions are made on the 2005 test set.
  • Results:
    • Test Error Rate: 0.4404 (or 55.95% accuracy). This is an improvement.
    • Confusion Matrix: | | True Down | True Up | | :— | :— | :— | | Pred Down | 77 | 69 | | Pred Up | 35 | 71 |
    • ROC and AUC: The ROC (Receiver Operating Characteristic) curve is plotted, and the AUC (Area Under the Curve) is calculated.
    • AUC Value: 0.5584. This is very close to 0.5 (which represents a random-chance model), indicating that the model has very weak predictive power, even though its accuracy is above 50%.

Model 3: LDA (Lag1 & Lag2) (Slide 92)

  • Model: LDA is now performed using the same setup: Lag1 and Lag2 as predictors, trained on the 2001-2004 data.
    • library(MASS)
    • lda.fit <- lda(Direction ~ Lag1 + Lag2, data=Smarket, subset=train)
  • Prediction: Predictions are made on the 2005 test set.
    • lda.pred <- predict(lda.fit, Smarket.2005)
  • Results:
    • Test Error Rate: 0.4404 (or 55.95% accuracy).
    • Confusion Matrix: | | True Down | True Up | | :— | :— | :— | | Pred Down | 77 | 69 | | Pred Up | 35 | 71 |
    • Observation: The confusion matrix and accuracy are identical to the logistic regression model.

Final Comparison (Slide 93)

  • ROC and AUC for LDA: The ROC curve for the LDA model is plotted.
  • AUC Value: 0.5584.
  • Main Conclusion: As highlighted in the green box, “LDA has identical performance as Logistic regression!” In this specific practical example, using these two predictors, both linear methods produce the exact same confusion matrix, the same accuracy (56%), and the same AUC (0.558). This reinforces the theoretical idea that both are fitting a linear boundary.

最终比较(幻灯片 93)

  • LDA 的 ROC 和 AUC:绘制了 LDA 模型的 ROC 曲线。
  • AUC 值:0.5584**。
  • 主要结论:如绿色方框所示,“LDA 的性能与 Logistic 回归相同!”** 在这个具体的实际示例中,使用这两个预测变量,两种线性方法都产生了完全相同的混淆矩阵、相同的准确率(56%)和相同的 AUC(0.558)。这强化了两者均拟合线性边界的理论观点。

4.7 R Example on Smarket Data (Continued)

The previous slides showed that Logistic Regression and Linear Discriminant Analysis (LDA) had identical performance on the Smarket dataset (using Lag1 and Lag2), both achieving 56% test accuracy and an AUC of 0.558. The analysis now tests a more flexible method, QDA.

Model 3: QDA (Lag1 & Lag2) (Slides 94-95)

  • Model: A Quadratic Discriminant Analysis (QDA) model is fit on the same training data (2001-2004) using only the Lag1 and Lag2 predictors.
    • qda.fit <- qda(Direction ~ Lag1 + Lag2, data=Smarket, subset=train)
  • Prediction: The model is used to predict the market direction for the 2005 test set.
  • Results:
    • Test Accuracy: The model achieves a test accuracy of 0.5992 (or 60%).
    • AUC: The Area Under the Curve (AUC) for the QDA model is 0.562.
  • Conclusion: As the slide highlights, “QDA has better test performance than LDA and Logistic regression!”

Smarket Example Summary

Method Model Type Test Accuracy AUC
Logistic Regression Linear ~56% 0.558
LDA Linear ~56% 0.558
QDA Quadratic ~60% 0.562

This practical example reinforces the lessons from the simulations (Section 4.6). The two linear methods (LDA, Logistic) had identical performance. The more flexible, non-linear QDA model performed better, suggesting that the true decision boundary between “Up” and “Down” (based on Lag1 and Lag2) is not perfectly linear.

4.8 Kernel LDA

This new section introduces an even more advanced non-linear method, Kernel LDA.

The Problem: Linear Inseparability (Slide 97)

The section starts with a clear visual example. A dataset of two concentric circles (a “donut” shape) is linearly inseparable. It is impossible to draw a single straight line to separate the inner (purple) class from the outer (yellow) class.

The Solution: The Kernel Trick (Slides 97, 99)

  1. Nonlinear Transformation: The data is “lifted” into a higher-dimensional feature space using a nonlinear transformation, \(x \mapsto \phi(x)\). In the example on the slide, the 2D data is transformed, and in this new space, the two classes become linearly separable.
  2. The “Kernel Trick”: The main idea (from slide 99) is that we don’t need to explicitly compute this complex transformation \(\phi(x)\). LDA (based on Fisher’s approach) only requires inner products of the data points. The “kernel trick” allows us to replace the inner product in the high-dimensional feature space (\(x_i^T x_j\)) with a simple kernel function, \(k(x_i, x_j)\), computed in the original, low-dimensional space.
    • An example of such a kernel is the Gaussian (RBF) kernel: \(k(x_i, x_j) \propto e^{-\|x_i - x_j\|^2 / \sigma^2}\).

Academic Foundations (Slide 98)

This method is based on foundational academic papers that generalized linear methods using kernels: * Fisher discriminant analysis with kernels (Mika, 1999) * Generalized Discriminant Analysis Using a Kernel Approach (Baudat, 2000) * Kernel principal component analysis (Schölkopf, 1997)

In short, Kernel LDA is an extension of LDA that uses the kernel trick to find a linear boundary in a high-dimensional feature space, which corresponds to a highly non-linear boundary in the original space.

1. QM9 数据集的XYZ格式详解

这个数据集使用的 “XYZ-like” 格式是一种扩展的、非标准的XYZ格式

行号 内容 解释
第 1 行 na 一个整数,代表分子中的原子总数。
第 2 行 Properties 1-17 包含17个理化性质的数值,用制表符或空格分隔。
第 3 到 na+2 行 Element x y z charge 每行代表一个原子。依次是:元素符号、x/y/z坐标(单位:埃)、Mulliken部分电荷(单位:e)。
第 na+3 行 Frequencies 分子的振动频率(3na-5或3na-6个)。
第 na+4 行 SMILES_GDB9 SMILES_relaxed 来自GDB9的SMILES字符串和弛豫后的几何构型的SMILES字符串。
第 na+5 行 InChI_GDB9 InChI_relaxed 对应的InChI字符串。

与标准XYZ格式对比: * 标准格式只有第1行(原子数)、第2行(注释)和后续的原子坐标行(仅含元素和xyz坐标)。 * QM9格式在第2行插入了大量属性数据,在原子坐标行增加了电荷列,并在文件末尾附加了频率、SMILES和InChI信息。

2. readme

  1. 数据集核心内容:
    • 它包含了133,885个小型有机分子(由H, C, N, O, F元素组成)的量子化学计算数据。
    • 所有分子的几何构型都经过了DFT/B3LYP/6-31G(2df,p)水平的优化。
    • dsC7O2H10nsd.xyz.tar.bz2是该数据集的一个子集,专门包含6,095个C₇H₁₀O₂的同分异构体,其能量学性质在更高精度的G4MP2理论水平下计算。
  2. 文件结构与格式:
    • 明确指出每个分子存储在单独的.xyz文件中,并详细描述了上述的非标准XYZ扩展格式
    • 详细列出了记录在文件第2行的17种理化性质,包括转动常数(A, B, C)、偶极矩(mu)、HOMO/LUMO能级、零点振动能(zpve)、内能(U)、焓(H)和吉布斯自由能(G)等。
  3. 数据来源与计算方法:
    • 数据源于GDB-9化学数据库。
    • 主要使用了两种量子化学理论水平:B3LYP用于大部分属性计算,G4MP2用于C₇H₁₀O₂子集的能量计算。
  4. 引用要求:
    • 文件明确要求,如果使用该数据集,需要引用Raghunathan Ramakrishnan等人在2014年发表于《Scientific Data》的论文。
  5. 其他信息:
    • 提供了一些额外文件(如validation.txt, uncharacterized.txt)的说明。
    • 提到了数据集中有少数几个分子在几何优化时难以收敛。

3. 可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import ase.io
import nglview as nv
import io

def parse_qm9_xyz(file_path):
"""
Parses a QM9 extended XYZ file and returns a standard XYZ string.
"""
with open(file_path, 'r') as f:
lines = f.readlines()

# First line is the number of atoms
num_atoms = int(lines[0].strip())

# The next line is properties (skip it)
# The next num_atoms lines are the coordinates
coord_lines = lines[2:2+num_atoms]

# Rebuild a standard XYZ format string in memory
standard_xyz = f"{num_atoms}\n"
standard_xyz += "Comment line\n" # Add a standard comment line
for line in coord_lines:
parts = line.split()
# Keep only the element and the x, y, z coordinates
standard_xyz += f"{parts[0]} {parts[1]} {parts[2]} {parts[3]}\n"

return standard_xyz

# Path to your data file
file_path = "/root/QM9/QM9/Data_for_6095_constitutional_isomers_of_C7H10O2.xyz/dsC7O2H10nsd_0001.xyz"

# 1. Parse the special file format into a standard XYZ string
standard_xyz_data = parse_qm9_xyz(file_path)

# 2. ASE reads the standard XYZ data from the string variable
# We use io.StringIO to make the string behave like a file
atoms = ase.io.read(io.StringIO(standard_xyz_data), format="xyz")

# 3. Create the nglview visualization widget
view = nv.show_ase(atoms)
view.add_ball_and_stick()

# Display the widget in the notebook output
view

  1. 定义解析函数 parse_qm9_xyz:
    • 目的: 将这个函数作为专门处理QM9特殊格式的工具。代码主体清晰,易于复用。
    • 读取文件: with open(...) 安全地打开文件,并用 f.readlines() 将文件所有行一次性读入一个列表 lines 中。
    • 提取原子数量: num_atoms = int(lines[0].strip()) 读取第一行(lines[0]),去除可能存在的空格(.strip()),并将其转换为整数。这是构建标准XYZ格式的必要信息。
    • 提取坐标信息: coord_lines = lines[2:2+num_atoms] 标信息从第3行开始(索引为2),持续num_atoms行。通过列表切片,精确地提取出所有包含原子坐标的行,跳过了第2行的属性信息。
    • 构建标准XYZ格式字符串:
      • 创建一个名为 standard_xyz 的新字符串。
      • 首先,将原子数量和换行符写入。
      • 然后,添加一行标准的注释(“Comment line”),这是标准XYZ格式所要求的。
      • 最后,遍历刚刚提取的 coord_lines 列表。对于每一行,使用 .split() 将其拆分成多个部分(例如:['C', 'x', 'y', 'z', 'charge'])。只取前四部分(元素符号和xyz坐标),并重新组合成新的一行,从而丢弃了末尾的Mulliken电荷数据
    • 返回结果: 函数返回一个包含了标准XYZ格式数据的、干净的字符串。
  2. 主程序执行流程:
    • 调用函数: standard_xyz_data = parse_qm9_xyz(file_path) 调用上面的函数,完成从文件到标准格式字符串的转换。
    • 在内存中读取: ase.io.read(io.StringIO(standard_xyz_data), format="xyz") 这一步非常高效。io.StringIO 将我们的字符串变量 standard_xyz_data 模拟成一个内存中的文本文件。这样,ase.io.read 就可以直接读取它,而无需先将清洗后的数据写入一个临时文件再读取,节省了磁盘I/O操作。
    • 可视化: 接下来的代码 (nv.show_ase等) 就和最初的设想一样了,因为此时 atoms 对象已经是通过标准、干净的数据成功创建的了。

C7O2H10

Fusing Sequence and Structural Information for Unified Protein Representation Learning

FusionProt

1 蛋白质表示学习:

  • 内容:

FusionProt :可学习融合 token和迭代双向信息交换,实现序列与结构的动态协同学习,而非静态拼接。

2. 一维(1D)氨基酸序列和三维(3D)空间结构:

  • 单模态依赖: ProteinBERT、ESM-2仅基于序列

  • 静态融合缺陷 :ESM-GearNet、SaProt 结合序列与结构,但采用 “单向 / 一次性融合”

好的,完全没有问题。这是对 FusionNetwork 模型架构代码的中文复述分析。

3. 模型总体

fusion

1
2
3
4
5
6
7
8
@R.register("models.FusionNetwork")
class FusionNetwork(nn.Module, core.Configurable):
def __init__(self, sequence_model, structure_model, fusion="series", cross_dim=None):
super(FusionNetwork, self).__init__()
self.sequence_model = sequence_model
self.structure_model = structure_model
self.output_dim = sequence_model.output_dim + structure_model.output_dim
self.inject_step = 5 # (sequence_layers / structure_layers) layers

  • class FusionNetwork(...): 定义了模型类,它继承自 PyTorch 的基础模块 nn.Module
  • __init__(...): 构造函数,接收已经初始化好的 sequence_modelstructure_model 作为输入。
  • self.output_dim: 定义了模型最终输出特征的维度。因为最后会将两个模型的特征拼接起来,所以是两者输出维度之和。
  • self.inject_step = 5:定义了信息“注入”或“交流”的频率。这里设置为 5,意味着每经过序列模型的 5 层,就会进行一次信息交换
1
2
3
4
# Structure embeddings layer
raw_input_dim = 21 # amino acid tokens
self.structure_embed_linear = nn.Linear(raw_input_dim, structure_model.input_dim)
self.embedding_batch_norm = nn.BatchNorm1d(structure_model.input_dim)
  • self.structure_embed_linear: 一个线性层,用于将原始的结构输入(比如 21 种氨基酸的独热编码)转换为结构模型(GNN)所期望的输入维度。
  • self.embedding_batch_norm: 批归一化层,用于稳定结构嵌入层的训练过程。
1
2
3
4
# Normal Initialization of the 3D structure token
structure_token = nn.Parameter(torch.Tensor(structure_model.input_dim).unsqueeze(0))
nn.init.normal_(structure_token, mean=0.0, std=0.01)
self.structure_token = nn.Parameter(structure_token.squeeze(0))
  • self.structure_token: 一个可学习的向量 (nn.Parameter)。这个“令牌”不代表任何真实的原子或氨基酸,而是一个抽象的载体。在训练过程中,它将学习如何编码和表示整个蛋白质的全局 3D 结构信息。它就像一个信息信使。
1
2
3
# Linear Transformation between structure to sequential spaces
self.structure_linears = nn.ModuleList([...])
self.seq_linears = nn.ModuleList([...])
  • self.structure_linears / self.seq_linears: 序列模型和结构模型内部处理的特征向量维度可能不同。当“3D 令牌”需要在两个模型之间传递时,这些线性层负责将它的表示从一个模型的特征空间转换到另一个模型的特征空间。

4. 前向

1
2
3
def forward(self, graph, input, all_loss=None, metric=None):
# Build a new protein graph with the 3D token (the lase node)
new_graph = self.build_protein_graph_with_3d_token(graph)
  • 首先调用辅助函数,将输入的蛋白质图谱进行改造:为图谱增加一个代表“3D 令牌”的新节点,并将这个新节点与图中所有其他节点连接起来。
序列模型的初始化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Sequence (ESM) model initialization
sequence_input = self.sequence_model.mapping[graph.residue_type]
sequence_input[sequence_input == -1] = graph.residue_type[sequence_input == -1]
size = graph.num_residues

# Check if sequence size is not bigger than max seq length
if (size > self.sequence_model.max_input_length).any():
starts = size.cumsum(0) - size
size = size.clamp(max=self.sequence_model.max_input_length)
ends = starts + size
mask = functional.multi_slice_mask(starts, ends, graph.num_residues)
sequence_input = sequence_input[mask]
graph = graph.subresidue(mask)
size_ext = size

# BOS == CLS
if self.sequence_model.alphabet.prepend_bos:
bos = torch.ones(graph.batch_size, dtype=torch.long, device=self.sequence_model.device) * self.sequence_model.alphabet.cls_idx
sequence_input, size_ext = functional._extend(bos, torch.ones_like(size_ext), sequence_input, size_ext)

if self.sequence_model.alphabet.append_eos:
eos = torch.ones(graph.batch_size, dtype=torch.long, device=self.sequence_model.device) * self.sequence_model.alphabet.eos_idx
sequence_input, size_ext = functional._extend(sequence_input, size_ext, eos, torch.ones_like(size_ext))

# Padding
tokens = functional.variadic_to_padded(sequence_input, size_ext, value=self.sequence_model.alphabet.padding_idx)[0]
repr_layers = [self.sequence_model.repr_layer]
assert tokens.ndim == 2
padding_mask = tokens.eq(self.sequence_model.model.padding_idx) # B, T
  • 序列数据进行 Transformer 模型(如 ESM)所需的标准预处理。
  • 包括添加序列开始(BOS)和结束(EOS)标记,以及将所有序列填充(Padding)到相同长度,以便进行批处理。
模型初始化与初次融合
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Sequence embedding layer
x = self.sequence_model.model.embed_scale * self.sequence_model.model.embed_tokens(tokens)

if self.sequence_model.model.token_dropout:
x.masked_fill_((tokens == self.sequence_model.model.mask_idx).unsqueeze(-1), 0.0)
# x: B x T x C
mask_ratio_train = 0.15 * 0.8
src_lengths = (~padding_mask).sum(-1)
mask_ratio_observed = (tokens == self.sequence_model.model.mask_idx).sum(-1).to(x.dtype) / src_lengths
x = x * (1 - mask_ratio_train) / (1 - mask_ratio_observed)[:, None, None]

# Structure model initialization
structure_hiddens = []
batch_size = graph.batch_size
structure_embedding = self.embedding_batch_norm(self.structure_embed_linear(input))
structure_token_batched = self.structure_token.unsqueeze(0).expand(batch_size, -1)
structure_input = torch.cat([structure_embedding.squeeze(1), structure_token_batched], dim=0)

# Add the 3D token representation
structure_token_expanded = self.structure_token.unsqueeze(0).expand(x.size(0), -1).unsqueeze(1)
x = torch.cat((x[:, :-1], structure_token_expanded, x[:, -1:]), dim=1)
padding_mask = torch.cat([padding_mask[:, :-1],
torch.zeros(padding_mask.size(0), 1).to(padding_mask), padding_mask[:, -1:]], dim=1)
size_ext += 1

if padding_mask is not None:
x = x * (1 - padding_mask.unsqueeze(-1).type_as(x))

repr_layers = set(repr_layers)
hidden_representations = {}
if 0 in repr_layers:
hidden_representations[0] = x

# (B, T, E) => (T, B, E)
x = x.transpose(0, 1)
if not padding_mask.any():
padding_mask = None
  • 将 3D 令牌插入序列。
    1. 为序列数据生成初始的词嵌入表示 x
    2. self.structure_token 的初始状态插入到序列嵌入 x 中,通常是放在序列结束标记(EOS)之前。
    3. 序列模型看到的输入序列变成了 [BOS, 残基1, 残基2, ..., 残基N, **3D令牌**, EOS] 的形式。
融合循环
1
2
3
4
5
6
7
8
for seq_layer_idx, seq_layer in enumerate(self.sequence_model.model.layers):
x, attn = seq_layer(
x,
self_attn_padding_mask=padding_mask,
need_head_weights=False,
)
if (seq_layer_idx + 1) in repr_layers:
hidden_representations[seq_layer_idx + 1] = x.transpose(0, 1)
  • 模型开始逐层遍历序列模型的所有层(例如 Transformer 的编码器层)。x 在每一层都会被更新。
1
if seq_layer_idx > 0 and seq_layer_idx % self.inject_step == 0:
  • 信息注入点:每当层数的索引能被 inject_step (即 5) 整除时,就触发一次信息交换。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 1. 从序列中提取 3D 令牌的表示
if structure_layer_index == 0:
structure_input = torch.cat((structure_input[:-1 * batch_size], x[-2, :, :]), dim=0)
else:
structure_input = torch.cat((structure_input[:-1 * batch_size],
self.seq_linears[structure_layer_index](x[-2, :, :])), dim=0)

# 2. 用结构模型的一层来处理
hidden = self.structure_model.layers[structure_layer_index](new_graph, structure_input)
if self.structure_model.short_cut and hidden.shape == structure_input.shape:
hidden = hidden + structure_input
if self.structure_model.batch_norm:
hidden = self.structure_model.batch_norms[structure_layer_index](hidden)

structure_hiddens.append(hidden)
structure_input = hidden

# 3. 将更新后的 3D 令牌表示插回序列
updated_structure_token = self.structure_linears[...](structure_input[-1 * batch_size:])
x = torch.cat((x[:-2, :, :], updated_structure_token.unsqueeze(0), x[-1:, :, :]), dim=0)
structure_layer_index += 1
  • 信息流程
    1. 从序列到结构:模型从序列表示 x 中提取出“3D 令牌”的最新向量。这个向量此时已经吸收了前面几层序列模型的上下文信息。然后,通过(seq_linears)将其转换后,更新到结构模型的输入中。
    2. 结构信息处理:运行一层结构模型(GNN)。GNN 根据图的连接关系更新所有节点的表示,当然也包括“3D 令牌”这个特殊节点。
    3. 从结构到序列:从 GNN 的输出中,再次提取出“3D 令牌”的向量。这个向量包含更新后的结构信息。再通过(structure_linears)转换后,把它插回到序列表示 x 中,替换掉旧的版本。

这个循环不断重复。

输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Structural Output
if self.structure_model.concat_hidden:
structure_node_feature = torch.cat(structure_hiddens, dim=-1)[:-1 * batch_size]
else:
structure_node_feature = structure_hiddens[-1][:-1 * batch_size]

structure_graph_feature = self.structure_model.readout(graph, structure_node_feature)

# Sequence Output
x = self.sequence_model.model.emb_layer_norm_after(x)
x = x.transpose(0, 1) # (T, B, E) => (B, T, E)

# last hidden representation should have layer norm applied
if (seq_layer_idx + 1) in repr_layers:
hidden_representations[seq_layer_idx + 1] = x
x = self.sequence_model.model.lm_head(x)

output = {"logits": x, "representations": hidden_representations}

# Sequence (ESM) model outputs
residue_feature = output["representations"][self.sequence_model.repr_layer]
residue_feature = functional.padded_to_variadic(residue_feature, size_ext)
starts = size_ext.cumsum(0) - size_ext
if self.sequence_model.alphabet.prepend_bos:
starts = starts + 1
ends = starts + size
mask = functional.multi_slice_mask(starts, ends, len(residue_feature))
residue_feature = residue_feature[mask]
graph_feature = self.sequence_model.readout(graph, residue_feature)

# Combine both models outputs
node_feature = torch.cat(...)
graph_feature = torch.cat(...)

return {"graph_feature": graph_feature, "node_feature": node_feature}
  • 提取输出:循环结束后,分别从两个模型中提取最终的特征表示。
  • 读出(Readout):使用一个“读出函数”(如求和或平均)将节点级别的特征聚合成一个代表整个蛋白质的图级别特征。
  • 最终组合:将来自序列模型和结构模型的节点特征(node_feature)和图特征(graph_feature)分别拼接(concatenate)起来。
  • 返回结果:返回一个包含组合后特征的字典,可用于下游任务(如功能预测、属性回归等)。