# Bias vs Variance in Machine Learning We want to find a function $\hat{f}(x ; D)$, that approximates the true function $f(x)$ as well as possible, by means of some learning algorithm based on a training dataset (sample) $D=\left\{\left(x_1, y_1\right) \ldots,\left(x_n, y_n\right)\right\}$. Bias is defined as the expected deviation of the estimator from the true value. $ \operatorname{Bias}_D[\hat{f}(x ; D)]=\mathrm{E}_D[\hat{f}(x ; D)-f(x)] $ Variance is the deviation of the estimator from the expected value. $ \operatorname{Var}_D[\hat{f}(x ; D)]=\mathrm{E}_D\left[\left(\mathrm{E}_D[\hat{f}(x ; D)]-\hat{f}(x ; D)\right)^2\right] $ Intuitively, Bias is a learner’s tendency to consistently learn the same wrong thing. Variance is the tendency to learn random things unrelated to the real signal. ![[bias_vs_variance.png]] ## Bias Variance Tradeoff ![[error_vs_model_complexity.png]] - Bias is reduced and variance is increased in relation to the complexity of the models produced by a learner. - As model capacity increases, bias tends to decrease, while variance tends to increase, yielding another U-shape relationship between generalization error versus model capacity. ## Bias Variance Decomposition of Generalization Error The phenomenon of overfitting is really an unfortunate property of [[Maximum Likelihood Estimation]] and does not arise when we marginalize over parameters in [[Bayesian Estimation]]. Let's consider a linear model with quadratic loss. For fixed $\mathbf{x}$, the expected loss is given as $ \begin{align} \mathbb{E}[L(t,y(\mathbf{x}))]&= \int (t - y(\mathbf{x}))^2 p(t|\mathbf{x})dt \\ \end{align} $ Our goal is to choose $y(\mathbf{x})$ so as to minimize $\mathbb{E}(L)$. Minimizing wrt $y(\mathbf{x})$ and setting to 0, $ \begin{align} \int y(\mathbf{x})p(t|\mathbf{x})dt &= \int t p(t|\mathbf{x})dt \\ y(\mathbf{x})&= \mathbb{E}[t|\mathbf{x}] \\ \end{align} $ This expression is known as the *regression function*. It means that $y(\mathbf{x})$ that minimizes the expected squared loss is given by the mean of the conditional distribution $p(t|x)$. ![[regression-function.jpg]] Expanding the squared term with conditional expectation, $\begin{align} \{y(\mathbf{x})-t\}^{2} &=\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]+\mathbb{E}[t \mid \mathbf{x}]-t\}^{2} \\ &=\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]\}^{2}+2\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]\}\{\mathbb{E}[t \mid \mathbf{x}]-t\}+\{\mathbb{E}[t \mid \mathbf{x}]-t\}^{2} \end{align} $ Substituting this in the loss function and integrating over $t$, $ \begin{align} \mathbb{E}[L]&=\int\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x}+\int\{\mathbb{E}[t \mid \mathbf{x}]-t\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \mathbb{E}[L]&=\int\{y(\mathbf{x})-\mathbb{E}[t \mid \mathbf{x}]\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x}+\int var[t|\mathbf{x}] p(\mathbf{x}) \mathrm{d} \mathbf{x} \end{align} $ The first term is the bias term and the second term is independent of $y$ and represents the intrinsic variability of the target data so can regarded as noise. Suppose we have large number of datasets. For any given data set $D$, we can run our learning algorithm and obtain a prediction function. Different data sets form an ensemble and will give different values of squared loss. The performance is then assed by taking the average over this ensemble of datasets given as: $ \begin{align} \mathbb{E}_{D}\left[\left(y_{D}(\mathbf{x})-\mathbb{E}[t|\mathbf{x}]\right)^{2}\right] \\ \end{align} $ Adding and subtracting quanitity $\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]$ to the term inside the square, we have, $ \begin{align} &\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]+\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}])\right\}^{2} \\ =&\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}+\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}])\right\}^{2} \\ &+2\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}]\right\} \end{align} $ Now taking the expectation, $ \mathbb{E}_{\mathcal{D}}\left[\{y(\mathbf{x} ; \mathcal{D})-\mathbf{E}[t|\mathbf{x}]\}^{2}\right] = \left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}]\right\}^{2}+\mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right] $ The first term gives the squared bias and the second term gives the variance. Substituting this back into $\mathbb{E[L]}$, the expected loss over different datasts decomposes as $ E\left[E_{D}[L]\right]=(\text { bias })^{2}+\text { variance }+\text { noise } $ where, $ \begin{aligned} (\text { bias })^{2} &=\int\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-\mathbb{E}[t|\mathbf{x}]\right\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \text { variance } &=\int \mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right] p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \text { noise } &=\int\{\mathbb{E}[t|\mathbf{x}]-t\}^{2} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d} t \end{aligned} $ This demonstrates interesting insight into model complexity issue from the frequentist perspective. The most flexible models have low bias with high variance, and rigid models have high bias and low variance, The model with optimal predictive capability is the one that leads to the best balance between bias and variance. --- ## References 1. Bishop 1.5 and 3.2 2. [_A Few Useful Things to Know about Machine Learning_](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf). Pedro Domingo. 2012. University of Washington.