7.7 贝叶斯方法和 BIC

同 AIC 类似，贝叶斯信息量准则（Bayesian information criterion，BIC）适用于通过对数似然函数的最大化来进行拟合的模型中。BIC 的一般形式为：

$$\text{BIC} = -2 \cdot \text{loglik} + (\log N) \cdot d\tag{7.35}$$

BIC 统计量（乘以 $1/2$ 后）也被称为 施瓦茨（Schwarz） 准则（Schwarz, 1978）。

在高斯模型中，假设已知方差 $\sigma^2_\varepsilon$，$-2\cdot\text{loglik}$（忽略一个常数后）等于 $\sum_i(y_i-\hat{f}(x_i))^2/\sigma^2_\varepsilon$，即为平方误差损失中的 $N\cdot\overline{\text{err}}/\sigma^2_\varepsilon$。因此可得：

$$\text{BIC} = \frac{N}{\sigma^2_\varepsilon} \left[ \overline{\text{err}} + (\log N) \cdot \frac{d}{N} \sigma^2_\varepsilon \right] \tag{7.36}$$

因此 BIC 与 AIC（$C_p$）成比例，并将因子 2 替换为 $\log N$。假设 $N>e^2\approx7.4$，则 BIC 倾向于加重对复杂模型的惩罚，更偏好于简单模型。同 AIC 一样，在低偏差模型中 $\sigma^2_\varepsilon$ 通常用均方差来估计。在分类问题中，使用多项分布的对数似然函数的 BIC，和使用交叉熵作为误差度量的 AIC 比较相似。然而需要注意在 BIC 中误分类误差度量并不适用，因为在任何概率模型中它都无法对应到样本的对数似然度。

尽管与 AIC 很相似，但 BIC 的出发点却非常不同。如下所述，它来自于使用贝叶斯方法的模型选择。

假设有一些待选模型 $\mathcal{M}_m$，$m=1,\dots,M$，模型的参数为 $\theta_m$，我们想从中选择最优的模型。假设每个模型 $\mathcal{M}_m$ 的先验概率为 $\operatorname{Pr}(\theta_m|\mathcal{M}_m)$，那么一个模型的后验概率为：

$$\begin{align} \operatorname{Pr}(\mathcal{M}_m|\mathbf{Z}) & \propto \operatorname{Pr}(\mathcal{M}_m) \cdot \operatorname{Pr}(\mathbf{Z}|\mathcal{M}_m)\\ & \propto \operatorname{Pr}(\mathcal{M}_m) \cdot \int \operatorname{Pr}(\mathbf{Z}|\theta_m, \mathcal{M}_m) \operatorname{Pr}(\theta_m|\mathcal{M}_m) d \theta_m \end{align}$$ $$\tag{7.37}$$

其中 $\mathbf{Z}$ 代表训练数据 $\{x_i,y_i\}_1^N$。在比较模型 $\mathcal{M}_m$ 和 $\mathcal{M}_\ell$ 时，计算后验几率：

$$\frac{\operatorname{Pr}(\mathcal{M}_m | \mathbf{Z})} {\operatorname{Pr}(\mathcal{M}_\ell | \mathbf{Z})} = \frac{\operatorname{Pr}(\mathcal{M}_m)} {\operatorname{Pr}(\mathcal{M}_\ell)} \cdot \frac{\operatorname{Pr}(\mathbf{Z} | \mathcal{M}_m)} {\operatorname{Pr}(\mathbf{Z} | \mathcal{M}_\ell)} \tag{7.38}$$

若几率大于一则选择模型 $m$，否则选择模型 $\ell$。上式最右侧的比例

$$\operatorname{BF}(\mathbf{Z}) = \frac{\operatorname{Pr}(\mathbf{Z} | \mathcal{M}_m)} {\operatorname{Pr}(\mathbf{Z} | \mathcal{M}_\ell)} \tag{7.39}$$

被称为 贝叶斯因子（Bayes factor），它是数据对后验几率的贡献。

通常假设模型的先验概率是均匀的，所以 $\operatorname{Pr}(\mathcal{M}_m)$ 是一个常数。接下来需要某种近似 $\operatorname{Pr}(\mathbf{Z}|\mathcal{M}_m)$ 的方法。经过对表达式 7.37 的某种简化后，可以对其中的积分进行拉普拉斯（Laplace）近似（Ripley 1996，第 64 页）：

$$\log\operatorname{Pr}(\mathbf{Z} | \mathcal{M}_m) = \log \operatorname{Pr}(\mathbf{Z} | \hat{\theta}_m, \mathcal{M}_m) - \frac{d_m}{2} \cdot \log N + O(1) \tag{7.40}$$

其中 $\hat{\theta}_m$ 为最大似然估计，$d_m$ 为模型 $\mathcal{M}_m$ 中的自由参数个数。若将损失函数定义为：

$$ -2 \log\operatorname{Pr}(\mathbf{Z} | \hat{\theta}_m, \mathcal{M}_m)$$

则这与式 7.35 中的 BIC 是等价的。

因此，通过最小化 BIC 选择模型等价于选择最大（近似）后验概率的模型。不过这个框架可以给出更多信息。如果对 $M$ 个模型计算 BIC 准则，得到 $\text{BIC}_m$，$m=1,2,\dots,M$，那么可以将每个模型 $\mathcal{M}_m$ 的后验概率估计为：

$$\frac{e^{\frac{1}{2} \cdot \text{BIC}_m}} {\sum_{\ell=1}^M e^{\frac{1}{2} \cdot \text{BIC}_\ell}} \tag{7.41}$$

因此我们不仅可以选择最优模型，同时也可以得到待选模型的相对表现度量。

对于模型选择使用 AIC 或 BIC，并没有定论。作为选择准则，BIC 有渐进一致性。这意味着给定包含了真实模型的一个模型族，BIC 会选择到正确模型的概率会随着样本量增加 $N\to\infty$ 而趋近于一。与之不同，AIC 在 $N\to\infty$ 时会趋于选择过复杂的模型。但另一方面，在有限样本中,由于 BIC 对复杂度的惩罚更重，经常会选择过于简单的模型。