10.4 指数损失函数与自适应提升

本节说明了自适应提升 M1（算法 10.1，adaboost.M1）等价于使用以下损失函数的前向分段加性模型（算法 10.2）。

$$L(y, f(x)) = \exp(- y f(x)) \tag{10.8}$$

而这个准则的合理性会在下一节讨论。

在自适应提升中，基函数即为每个分类器 $G_m(x)\in\{-1,1\}$。使用指数损失函数，最小化问题可写为：

$$(\beta_m, G_m) = \underset{\beta,G}{\arg\min} \sum_{i=1}^N \exp[-y_i(f_{m-1}(x_i) + \beta G(x_i))]$$

在每一步要解出添加至模型中的分类器 $G_m$ 和对应的系数 $\beta_m$。上式可写为：

$$(\beta_m, G_m) = \underset{\beta,G}{\arg\min} \sum_{i=1}^N w_i^{(m)}\exp(-\beta y_i G(x_i)) \tag{10.9}$$

其中的 $w_i^{(m)}=\exp(-y_if_{m−1}(x_i))$。由于每个 $w_i^{(m)}$ 既不依赖于 $\beta$ 也不依赖于 $G(x)$，可以将其视为赋予每个样本的权重。这个权重依赖于 $f_{m-1}(x_i)$，所以每一步迭代 $m$ 中样本的权重值会一直变化。

式 10.9 的解可通过两个步骤得出。首先，对任意给定的 $\beta>0$，式 10.9 中 $G_m(x)$ 的解为：

$$G_m = \underset{G}{\arg\min} \sum_{i=1}^N w_i^{(m)} I(y_i \neq G(x_i)) \tag{10.10}$$

也就是对预测 $y$ 的加权误差率最小化的分类器。为证明这一点，可将式 10.9 中的准则重写为：

$$e^{-\beta} \cdot \sum_{y_i = G(x_i)}w_i^{(m)} + e^{\beta} \cdot \sum_{y_i \neq G(x_i)}w_i^{(m)}$$

这又可写为：

$$(e^{-\beta} - e^{\beta}) \cdot \sum_{i=1}^N w_i^{(m)} I(y_i \neq G(x_i)) + e^{-\beta} \cdot \sum_{i=1}^N w_i^{(m)} \tag{10.11}$$

将这个 $G_m$ 带入回表达式 10.9，并对 $\beta$ 求解，可得：

$$\beta_m = \frac{1}{2} \log \frac{1 - \text{err}_m}{\text{err}_m} \tag{10.12}$$

其中的 $\text{err}_m$ 为最小化的加权误差率：

$$\text{err}_m = \frac{\sum_{i=1}^N w_i^{(m)}I(y_i \neq G(x_i))} {\sum_{i=1}^N w_i^{(m)}} \tag{10.13}$$

然后更新模型的函数 $f_m(x)=f_{m−1}(x)+\beta_mG_m(x)$，这也给出了下一个迭代中的权重为：

$$w_i^{(m+1)} = w_i^{(m)} \cdot e^{-\beta_m y_i G_m(x_i)} \tag{10.14}$$

利用关系等式 $-y_iG_m(x_i)=2I(y_i\neq G_m(x_i))-1$，式 10.14 变为：

$$w_i^{(m+1)}= w_i^{(m)} \cdot e^{-\alpha_m I(y_i \neq G_m(x_i))} \cdot e^{-\beta_m} \tag{10.15}$$

其中的 $\alpha_m=2\beta_m$ 就等于自适应提升 M1（算法 10.1）的步骤 2.3 中所定义的数值。所有的权重都与式 10.15 中的同一个因子 $e^{-\beta_m}$ 相乘，所以它对结果没有影响。因此式 10.15 等价于算法 10.1 的步骤（2.4）。

可以将自适应提升 M1 算法的步骤（2.1）视为求解式 10.11，以及式 10.10，中最小化求解的近似方法，这样就证明了自适应提升 M1 的结果是前向分段加性模型中指数损失准则（式 10.8）的最小化解。

**图 10.3**：模拟数据，树桩模型的提升方法：训练集上的误分类误差率以及平均指数损失函数 $(1/N)\sum_{i=1}^N\exp(-y_if(x_i))$。大约 250 次迭代之后，误分类误差率降为零，而指数损失函数仍在继续下降中。

图 10.3 展示了在图 10.2 的模拟数据问题（式 10.2）中的训练集误分类误差率和平均指数损失函数。训练集误分类误差率在大约 250 次迭代后降至零（并保持为零），但指数损失函数仍继续下降。同时注意到在图 10.2 中，测试集误分类误差率在 250 次迭代之后仍在持续改善。很明显，自适应提升不是对训练集误分类误差最优化；指数损失函数对估计的类别概率更加敏感。

本节练习

练习 10.1

Derive expression (10.12) for the update parameter in AdaBoost.

练习 10.4

Write a program implementing AdaBoost with trees.
Redo the computations for the example of Figure 10.2. Plot the training error as well as test error, and discuss its behavior.
Investigate the number of iterations needed to make the test error finally start to rise.
Change the setup of this example as follows: define two classes, with the features in Class 1 being X1 , X2 , . . . , X10 , standard independent Gaussian variates. In Class 2, the features X1 , X2 , . . . , X10 are also standard independent Gaussian, but conditioned on the event $\sum_jX_j^2>12$. Now the classes have significant overlap in feature space. Repeat the AdaBoost experiments as in Figure 10.2 and discuss the results.