Bias–variance tradeoff: Difference between revisions

Content deleted Content added
Line 30:
In addition, one has to be careful how to define complexity: In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from:<ref>{{cite book |last1=Vapnik |first1=Vladimir |title=The nature of statistical learning theory |date=2000 |publisher=Springer-Verlag |location=New York |isbn=978-1-4757-3264-1 |url=https://dx.doi.org/10.1007/978-1-4757-3264-1}}</ref> The model <math>f_{a,b}(x)=a\sin(bx)</math> has only two parameters (<math>a,b</math>) but it can interpolate any number of points by oscillating with a high enough frequency, resulting in both a high bias and high variance.
 
An analogy can be made to the relationship between [[Accuracy and precision|accuracy and precision]]. Accuracy is a description of bias and can intuitively be improved by selecting from only [[Sample space|local]] information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words, [[Training, validation, and test data sets|test data]] may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is due to inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be [[smoothing|smoothed]] via explicit [[Regularization (mathematics)|regularization]], such as [[shrinkage (statistics)|shrinkage]].
Intuitively, bias is reduced by using only local information, whereas variance can only be reduced by averaging over multiple observations, which inherently means using information from a larger region. For an enlightening example, see the section on k-nearest neighbors or the figure on the right.
To balance how much information is used from neighboring observations, a model can be [[smoothing|smoothed]] via explicit [[Regularization (mathematics)|regularization]], such as [[shrinkage (statistics)|shrinkage]].
 
==Bias–variance decomposition of mean squared error==