Maximum Likelihood Estimation

# Maximum Likelihood Estimation Maximum likelihood principle states that the most likely explanation of data D is given by $W_{ML}$ which maximizes the likelihood function. $ W_{ML} = \underset{\mathbf{w}}{\arg \max } p(D|w) $ Let's assume the data is i.i.d. This means the joint probability reduces to the product of individual PDFs. (For correlated data ex. time series, we can't assume this) $ p(D \mid \mathbf{w})=p\left(x_{1}, x_{2}, \ldots, x_{N} \mid \mathbf{w}\right)=\prod_{i=1}^{N} p\left(x_{i} \mid \mathbf{w}\right) $ So, maximum likelihood estimation is given as: $ \mathbf{w}_{\mathrm{ML}}=\underset{\mathbf{w}}{\arg \max } p(D \mid \mathbf{w})=\underset{\mathbf{w}}{\arg \max } \prod_{i=1}^{N} p\left(x_{i} \mid \mathbf{w}\right) $ Since logarithm is monotonically increasing function of its argument, maximizing log of a function is equivalent to maximization of function itself. Using logarithm also helps with preventing numerical underflow as product of small numbers grows smaller very fast. So, $ \mathbf{w}_{\mathrm{ML}}=\underset{\mathbf{w}}{\arg \max } \sum_{i=1}^{N} \log p\left(x_{i} \mid \mathbf{w}\right) $ We find analytical solutions of ML estimates of parameters by taking the derivative of likelihood function with respect to the parameters and setting them to zero. ## MLE of Gaussian Distribution Let's assume the dataset consists of i.i.d Gaussian distributed real variables. Then we the likelihood function is given as the product of individual PDFs, $ p(x \mid \mathbf{w})=\mathcal{N}\left(x \mid \mu, \sigma^{2}\right) = p\left(D \mid \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{N / 2}} \prod_{i=1}^{N} \exp \left[-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}\right] $ Now, using log of the the distribution, $ \begin{align} \log p\left(D \mid \mu, \sigma^{2}\right)&=\log \left(2 \pi \sigma^{2}\right)^{-N / 2}+\sum_{i=1}^{N} \log \exp \left[-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}\right] \\ &= -\frac{N}{2} \log \left(2 \pi \sigma^{2}\right)+\sum_{i=1}^{N}-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2} \end{align} $ Now for maximizing likelihood for $\mu$, we take the derivative wtih respect to $\mu$, $ \begin{align} \frac{\partial}{\partial \mu} \log p\left(D \mid \mu, \sigma^{2}\right) &= 0 \\ \frac{1}{2\sigma^2} \sum_{i=1^N}2(x_i - \mu) &= 0 \\ \sum_{i=1}^N(x_i - \mu) &= 0 \\ \sum_{i=1}^N\mu &= \sum_{i=1^N}x_i \\ N.\mu &= \sum_{i=1}^N x_i \\ \mu &= \frac{1}{N}\sum_{i=1}^N x_i \end{align} $ which is the sample mean. Therefore, $\mu_{ML}$ is equal to the sample mean. Now, maximizing the likelihood for $\sigma^2$, we take the derivative with respect to $\sigma^2$, $ \frac{\partial}{\partial \sigma^{2}} \log p\left(D \mid \mu, \sigma^{2}\right) = -\frac{N}{2}\frac{1}{2\pi\sigma^2}.2\pi + \frac{1}{2\sigma^4}\sum_{i=1}{N}(x_i - \mu^2) = 0 $ Multiplying by $2\sigma^2$, $ \begin{align} -N\sigma^2 + \sigma_{i=1}^{N}(x_i - \mu^2) &= 0 \\ \sigma^2 &= \frac{1}{N}\sum_{i=1}^N (x - \mu^2) \end{align} $ which is the sample variance. Therefore, $\sigma^2_{ML}$ is equal to the sample variance. ## MLE of Binomial Distribution The likelihood function $ f(m \mid n, p)=\left(\begin{array}{c} n \\ m \end{array}\right) p^{m}(1-p)^{n-m} $ Find extremum using the log-likelihood. $ \frac{d}{d p} \ln f(m \mid n, p)=\frac{m}{p}-\frac{n-m}{1-p}=\Longleftrightarrow m(1-p)=p(n-m) \Longleftrightarrow m=p n \Longleftrightarrow p=\frac{m}{n} $ ## Bias in Maximum Likelihood Estimators Bias is the difference in the statistic of the estimator and the true statistic. To check MLE for bias, let's do a sanity check. Let's suppose we draw two datasets $D_1$ and $D2$ from a Gaussian distribution. We know that $\mu_{ML}$ and $\mu^2_{ML}$ are the functions of dataset values. Now, for case of $\mu_{ML}$, $ \begin{align} \mathbb{E}_{D \sim p\left(D \mid \mu, \sigma^{2}\right)}\left[\mu_{M L}\right] &=\mathbb{E}\left[\frac{1}{N} \sum_{i=1}^{N} x_{i}\right] \\ &= \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{D \sim p\left(D \mid \mu, \sigma^{2}\right)}[x_{i}] \\ &= \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{x_i \sim p\left(x_i \mid \mu, \sigma^{2}\right)}[x_{i}] \\ &= \frac{1}{N} \sum_{i=1}^{N} \mu \\ &= \mu \end{align} $ Therefore, bias of the estimator $\mathbb{E}[\mu_{ML}] - \mu = 0$ In case of variance, $ \begin{align} \mathbb{E}_{D \sim p(D \mid \mu, \sigma)}\left[\sigma_{M L}^{2}\right]&=E\left[\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}-\frac{1}{N} \sum_{n=1}^{N} x_{n}\right)^{2}\right] \\ &=\frac{1}{N} \sum_{i=1}^{N} E\left[\left(x i-\frac{1}{N} \sum_{n=1}^{N} x_{n}\right)^{2}\right]\\ &= \frac{1}{N} \sum_{i=1}^{N} E\left[x_{i}^{2}-\frac{2 x_{i}}{N} \sum_{n=1}^{N} x_{n}+\frac{1}{N^{2}} \sum_{m=1}^{N} \sum_{n=1}^{N} x_{m} x_{n}\right]\\ &=\frac{1}{N} \sum_{i=1}^{N}\left\{E\left[x_{i}^{2}\right]-\frac{2}{N} \sum_{n=1}^{N} \mathbb{E}\left[x_{i} x_{n}\right]+\frac{1}{N^{2}} \sum_{m=1}^{N} \sum_{n=1}^{N} E\left[x_{m} x_{n}\right]\right\} \end{align} $ Now, $ \begin{align} E[x_i,x_i] = cov[x_i] \Rightarrow &E[x_i^2] - E[x_i]^2 = \sigma^2 \\ & E[x_i^2] = \mu^2 + \sigma^2\\ \\ E[x_i,x_j] = cov[x_i,x_j] \Rightarrow &E[x_i.x_j] - E[x_i]E[x_j] = 0 \\ & E[x_i.x_j] = \mu^2 & \end{align} $ Using these values in the expression above, $ \begin{align} \mathbb{E}_{D \sim p(D \mid \mu, \sigma)}\left[\sigma_{M L}^{2}\right]&= \frac{1}{N} \sum_{i=1}^{N}\left\{\mu^{2}+\sigma^{2}-\frac{2}{N}\left(N \mu^{2}+\sigma^{2}\right)+\frac{1}{N^{2}}\left(N^{2} \mu^{2}+N \sigma^{2}\right)\right\} \\ &= \frac{N-1}{N}\sigma^2 \end{align} $ Therefore, bias of the estimator is $\frac{N-1}{N}$. This means that MLE behaves in a biased way when N is small, but as $N\Rightarrow\infty$ the bias is reduced. This can also be adjusted by apply the transformation of $\frac{N}{N-1}$.