Bias vs Variance in Machine Learning

# Bias vs Variance in Machine Learning We want to find a function $\hat{f}(x ; D)$, that approximates the true function $f(x)$ as well as possible, by means of some learning algorithm based on a training dataset D. Bias is defined as the expected deviation of the estimator from the true value. $ \operatorname{Bias}_D[\hat{f}(x ; D)]=\mathrm{E}_D[\hat{f}(x ; D)-f(x)] $ Variance is the deviation of the estimator from the expected value. $ \operatorname{Var}_D[\hat{f}(x ; D)]=\mathrm{E}_D\left[\left(\mathrm{E}_D[\hat{f}(x ; D)]-\hat{f}(x ; D)\right)^2\right] $ Bias measures systematic error - when your model consistently misses the target in the same direction, regardless of which training set you use. "Even if I had infinite training data, would my model still be wrong on average?" A high-bias model makes strong assumptions that prevent it from capturing the true relationship (like using linear regression for a curved relationship). A low-bias model can, in principle, learn the true function if given enough data. Variance measures how much the estimator changes when trained on different samples - it's about the consistency/stability of your model across different inputs. "If I train the same algorithm on different samples from the same population, how much do the resulting models differ from each other?" A high-variance model might learn perfectly valid patterns, but it's very sensitive to which particular training examples it sees. A low-variance model gives similar results regardless of the specific training samples. ![[bias_vs_variance.png]] ## Bias Variance Tradeoff ![[error_vs_model_complexity.png]] - Bias is reduced and variance is increased in relation to the complexity of the models produced by a learner. - As model capacity increases, bias tends to decrease, while variance tends to increase, yielding another U-shape relationship between generalization error versus model capacity. ## Bias Variance Decomposition of Generalization Error The phenomenon of overfitting is really an unfortunate property of [[Maximum Likelihood Estimation]] and does not arise when we marginalize over parameters in [[Bayesian Estimation]]. Let's consider a linear model with quadratic loss. For fixed $\mathbf{x}$, the expected loss is given as $ \begin{align} \mathbb{E}[L(t,y(\mathbf{x}))]&= \int (t - y(\mathbf{x}))^2 p(t|\mathbf{x})dt \\ \end{align} $ Our goal is to choose $y(\mathbf{x})$ so as to minimize $\mathbb{E}(L)$. Minimizing wrt $y(\mathbf{x})$ and setting to 0, $ \begin{align} \int y(\mathbf{x})p(t|\mathbf{x})dt &= \int t p(t|\mathbf{x})dt \\ y(\mathbf{x})&= \mathbb{E}[t|\mathbf{x}] \\ \end{align} $ This expression is known as the *regression function*. It means that $y(\mathbf{x})$ that minimizes the expected squared loss is given by the mean of the conditional distribution $p(t|x)$. ![[regression-function.jpg]] Expanding the squared term with conditional expectation, $\begin{align} \{y(\mathbf{x})-t\}^{2} &=\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]+\mathbb{E}[t \mid \mathbf{x}]-t\}^{2} \\ &=\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]\}^{2}+2\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]\}\{\mathbb{E}[t \mid \mathbf{x}]-t\}+\{\mathbb{E}[t \mid \mathbf{x}]-t\}^{2} \end{align} $ Substituting this in the loss function and integrating over $t$, $ \begin{align} \mathbb{E}[L]&=\int\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x}+\int\{\mathbb{E}[t \mid \mathbf{x}]-t\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \mathbb{E}[L]&=\int\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x}+\int var[t|\mathbf{x}] p(\mathbf{x}) \mathrm{d} \mathbf{x} \end{align} $ The first term is the bias term and the second term is independent of $y$ and represents the intrinsic variability of the target data so can regarded as noise. Suppose we have large number of datasets. For any given data set $D$, we can run our learning algorithm and obtain a prediction function. Different data sets form an ensemble and will give different values of squared loss. The performance is then assed by taking the average over this ensemble of datasets given as: $ \begin{align} \mathbb{E}_{D}\left[\left(y_{D}(\mathbf{x})-\mathbb{E}[t|\mathbf{x}]\right)^{2}\right] \\ \end{align} $ Adding and subtracting quanitity $\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]$ to the term inside the square, we have, $ \begin{align} &\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]+\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}])\right\}^{2} \\ =&\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}+\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}])\right\}^{2} \\ &+2\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}]\right\} \end{align} $ Now taking the expectation, $ \mathbb{E}_{\mathcal{D}}\left[\{y(\mathbf{x} ; \mathcal{D})-\mathbf{E}[t|\mathbf{x}]\}^{2}\right] = \left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}]\right\}^{2}+\mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right] $ The first term gives the squared bias and the second term gives the variance. Substituting this back into $\mathbb{E[L]}$, the expected loss over different datasts decomposes as $ E\left[E_{D}[L]\right]=(\text { bias })^{2}+\text { variance }+\text { noise } $ where, $ \begin{aligned} (\text { bias })^{2} &=\int\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}]\right\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \text { variance } &=\int \mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right] p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \text { noise } &=\int\{\mathbb{E}[t|\mathbf{x}]-t\}^{2} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d} t \end{aligned} $ This demonstrates interesting insight into model complexity issue from the frequentist perspective. The most flexible models have low bias with high variance, and rigid models have high bias and low variance, The model with optimal predictive capability is the one that leads to the best balance between bias and variance. --- ## References 1. Bishop 1.5 and 3.2 2. [_A Few Useful Things to Know about Machine Learning_](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf). Pedro Domingo. 2012. University of Washington.