7 模型评估和选择

一个学习方法的 泛化（generalization） 表现指的是它在独立的测试数据上的预测能力。在实践中对泛化表现的评估极其重要，因为它是选择学习方法或模型的标准，同时也对最终所选模型的效果提供一个指标。

本章会介绍并演示评估（模型）表现的重要方法，并展示如何用它们来选择模型。在那之前，会先对偏差、方差、和模型复杂度之间的相互关系做个介绍。

内容概要

7.2 偏差、方差和模型复杂度

第 219-223 页。本节介绍模型选择和评估的一般性概念。在理想情况下，一般将数据分为训练集、验证集和测试集，分别用来拟合模型、选择模型和评估最终模型。其中选择模型和评估模型都基于模型的测试误差，而不是训练误差。
7.3 偏差和方差分解

第 223-228 页。用理论分析、图示和实例展示一个模型与真实函数之间差异的分解。其中的偏差和方差通常是此消彼涨的关系。由于 0-1 损失的离散性，其预测误差的表现与平方误差损失不同。
7.4 训练误差率中的乐观值

第 228-230 页。更详细地描述了衡量模型的几种不同的“误差”。训练误差会低估泛化误差，其差距被定义为“乐观值”。其原因不止是样本外与样本内的差别，即使在样本内，乐观值也会随着模型的拟合程度而变化。
7.5 样本内预测误差的估计

第 230-232 页。介绍了用于模型选择的 AIC 统计量。
7.6 有效参数数量

第 232-233 页。将线性回归中参数数量的概念拓展到更一般的模型中，即有效参数数量或有效自由度。
7.7 贝叶斯方法和 BIC

第 233-235 页。介绍了贝叶斯信息量准则，从贝叶斯方法下的意义，以及与 AIC 的关系。
7.8 最小描述长度

第 235-237 页。从信息编码领域中的最小描述长度也可推导出与 BIC 一致的准则。
7.9 万普尼克-泽范兰杰斯维度 😱

第 237-241 页。VC 维度根据函数类的弯曲程度而不是参数个数计算其复杂度，从这个定义出发可推出很多结论，预测误差的边界就是其中之一。但是除简单模型外，通常难以准确计算 VC 维度。
7.10 交叉验证

第 241-249 页。交叉验证可以估计预测误差，并可用于选择调节参数。需要注意在每一步流程中需要完全重新拟合模型，尤其在涉及到变量选择的方法时，否则会极大地低估误差。
7.11 自助抽样方法

第 249-254 页。用自助抽样方法估计预测误差，以及对其中偏差的修正方法。
7.12 条件还是无条件？ 😱

第 254-257 页。从某个特定的训练集数据中估计其测试误差一般来说并不简单。然而，交叉验证以及其他相关的方法可以较好地估计期望误差。

本章练习

练习 7.1：第 7.4 节
练习 7.2：第 7.3 节
练习 7.3：第 7.10 节
练习 7.4：第 7.4 节
练习 7.5：第 7.6 节
练习 7.6：第 7.9 节
练习 7.7：第 7.10 节
练习 7.8：第 7.9 节
练习 7.9：第 7.11 节
练习 7.10：第 7.10 节

参考文献

Key references for cross-validation are Stone (1974), Stone (1977) and Allen (1974). The AIC was proposed by Akaike (1973), while the BIC was introduced by Schwarz (1978). Madigan and Raftery (1994) give an overview of Bayesian model selection. The MDL criterion is due to Rissa- nen (1983). Cover and Thomas (1991) contains a good description of coding theory and complexity. VC dimension is described in Vapnik (1996). Stone (1977) showed that the AIC and leave-one out cross-validation are asymp- totically equivalent. Generalized cross-validation is described by Golub et al. (1979) and Wahba (1980); a further discussion of the topic may be found in the monograph by Wahba (1990). See also Hastie and Tibshirani (1990), Chapter 3. The bootstrap is due to Efron (1979); see Efron and Tibshi- rani (1993) for an overview. Efron (1983) proposes a number of bootstrap estimates of prediction error, including the optimism and .632 estimates. Efron (1986) compares CV, GCV and bootstrap estimates of error rates. The use of cross-validation and the bootstrap for model selection is stud- ied by Breiman and Spector (1992), Breiman (1992), Shao (1996), Zhang (1993) and Kohavi (1995). The .632+ estimator was proposed by Efron and Tibshirani (1997). Cherkassky and Ma (2003) published a study on the performance of SRM for model selection in regression, in response to our study of section 7.9.1. They complained that we had been unfair to SRM because had not applied it properly. Our response can be found in the same issue of the journal (Hastie et al. (2003)).

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle, Second International Symposium on Information Theory, pp. 267–281.（pdf）
Schwarz, G. (1978). Estimating the dimension of a model, Annals of Statistics 6(2): 461–464.（pdf）
Madigan, D. and Raftery, A. (1994). Model selection and accounting for model uncertainty using Occam’s window, Journal of the American Statistical Association 89: 1535–46.（pdf）
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length, Annals of Statistics 11: 416–431.
Cover, T. and Thomas, J. (1991). Elements of Information Theory, Wiley, New York.
Vapnik, V. (1996). The Nature of Statistical Learning Theory, Springer, New York.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society Series B 36: 111–147. （pdf）
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion, Journal of the Royal Statistical Society Series B. 39: 44–7. （pdf）
Golub, G., Heath, M. and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21: 215–224.
Wahba, G. (1980). Spline bases, regularization, and generalized cross-validation for solving approximation problems with large quantities of noisy data, Proceedings of the International Conference on Approximation theory in Honour of George Lorenz, Academic Press, Austin, Texas, pp. 905–912.
Wahba, G. (1990). Spline Models for Observational Data, SIAM, Philadelphia.
Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models, Chapman and Hall, London. Chapter 3.
Efron, B. (1979). Bootstrap methods: another look at the jackknife, Annals of Statistics 7: 1–26.（pdf）
Efron, B. (1983). Estimating the error rate of a prediction rule: some improvements on cross-validation, Journal of the American Statistical Association 78: 316–331.
Efron, B. (1986). How biased is the apparent error rate of a prediction rule?, Journal of the American Statistical Association 81: 461–70.
Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap, Chapman and Hall, London.
Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: the 632+ bootstrap: method, Journal of the American Statistical Association 92: 548–560.
Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression: the X-random case, International Statistical Review 60: 291–319.
Breiman, L. (1992). The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error, Journal of the American Statistical Association 87: 738–754.
Shao, J. (1996). Bootstrap model selection, Journal of the American Statistical Association 91: 655–665.
Zhang, P. (1993). Model selection via multifold cross-validation, Annals of Statistics 21: 299–311.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection, International Joint Conference on Artificial Intelligence (IJCAI), Morgan Kaufmann, pp. 1137–1143.