# Maximum Likelihood Estimation
Maximum likelihood principle states that the most likely explanation of data D is given by $W_{ML}$ which maximizes the likelihood function.
$
W_{ML} = \underset{\mathbf{w}}{\arg \max } p(D|w)
$
Let's assume the data is i.i.d. This means the joint probability reduces to the product of individual PDFs. (For correlated data ex. time series, we can't assume this)
$
p(D \mid \mathbf{w})=p\left(x_{1}, x_{2}, \ldots, x_{N} \mid \mathbf{w}\right)=\prod_{i=1}^{N} p\left(x_{i} \mid \mathbf{w}\right)
$
So, maximum likelihood estimation is given as:
$
\mathbf{w}_{\mathrm{ML}}=\underset{\mathbf{w}}{\arg \max } p(D \mid \mathbf{w})=\underset{\mathbf{w}}{\arg \max } \prod_{i=1}^{N} p\left(x_{i} \mid \mathbf{w}\right)
$
Since logarithm is monotonically increasing function of its argument, maximizing log of a function is equivalent to maximization of function itself. Using logarithm also helps with preventing numerical underflow as product of small numbers grows smaller very fast. So,
$
\mathbf{w}_{\mathrm{ML}}=\underset{\mathbf{w}}{\arg \max } \sum_{i=1}^{N} \log p\left(x_{i} \mid \mathbf{w}\right)
$
We find analytical solutions of ML estimates of parameters by taking the derivative of likelihood function with respect to the parameters and setting them to zero.
## MLE of Gaussian Distribution
Let's assume the dataset consists of i.i.d Gaussian distributed real variables. Then we the likelihood function is given as the product of individual PDFs,
$
p(x \mid \mathbf{w})=\mathcal{N}\left(x \mid \mu, \sigma^{2}\right) = p\left(D \mid \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{N / 2}} \prod_{i=1}^{N} \exp \left[-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}\right]
$
Now, using log of the the distribution,
$
\begin{align}
\log p\left(D \mid \mu, \sigma^{2}\right)&=\log \left(2 \pi \sigma^{2}\right)^{-N / 2}+\sum_{i=1}^{N} \log \exp \left[-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}\right] \\
&= -\frac{N}{2} \log \left(2 \pi \sigma^{2}\right)+\sum_{i=1}^{N}-\frac{1}{2 \sigma^{2}}\left(x_{i}-\mu\right)^{2}
\end{align}
$
Now for maximizing likelihood for $\mu$, we take the derivative wtih respect to $\mu$,
$
\begin{align}
\frac{\partial}{\partial \mu} \log p\left(D \mid \mu, \sigma^{2}\right) &= 0 \\
\frac{1}{2\sigma^2} \sum_{i=1^N}2(x_i - \mu) &= 0 \\
\sum_{i=1}^N(x_i - \mu) &= 0 \\
\sum_{i=1}^N\mu &= \sum_{i=1^N}x_i \\
N.\mu &= \sum_{i=1}^N x_i \\
\mu &= \frac{1}{N}\sum_{i=1}^N x_i
\end{align}
$
which is the sample mean. Therefore, $\mu_{ML}$ is equal to the sample mean.
Now, maximizing the likelihood for $\sigma^2$, we take the derivative with respect to $\sigma^2$,
$
\frac{\partial}{\partial \sigma^{2}} \log p\left(D \mid \mu, \sigma^{2}\right) = -\frac{N}{2}\frac{1}{2\pi\sigma^2}.2\pi + \frac{1}{2\sigma^4}\sum_{i=1}{N}(x_i - \mu^2) = 0
$
Multiplying by $2\sigma^2$,
$
\begin{align}
-N\sigma^2 + \sigma_{i=1}^{N}(x_i - \mu^2) &= 0 \\
\sigma^2 &= \frac{1}{N}\sum_{i=1}^N (x - \mu^2)
\end{align}
$
which is the sample variance. Therefore, $\sigma^2_{ML}$ is equal to the sample variance.
## MLE of Binomial Distribution
The likelihood function
$
f(m \mid n, p)=\left(\begin{array}{c}
n \\
m
\end{array}\right) p^{m}(1-p)^{n-m}
$
Find extremum using the log-likelihood.
$
\frac{d}{d p} \ln f(m \mid n, p)=\frac{m}{p}-\frac{n-m}{1-p}=\Longleftrightarrow m(1-p)=p(n-m) \Longleftrightarrow m=p n \Longleftrightarrow p=\frac{m}{n}
$
## Bias in Maximum Likelihood Estimators
Bias is the difference in the statistic of the estimator and the true statistic. To check MLE for bias, let's do a sanity check.
Let's suppose we draw two datasets $D_1$ and $D2$ from a Gaussian distribution. We know that $\mu_{ML}$ and $\mu^2_{ML}$ are the functions of dataset values.
Now, for case of $\mu_{ML}$,
$
\begin{align}
\mathbb{E}_{D \sim p\left(D \mid \mu, \sigma^{2}\right)}\left[\mu_{M L}\right] &=\mathbb{E}\left[\frac{1}{N} \sum_{i=1}^{N} x_{i}\right] \\
&= \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{D \sim p\left(D \mid \mu, \sigma^{2}\right)}[x_{i}] \\
&= \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{x_i \sim p\left(x_i \mid \mu, \sigma^{2}\right)}[x_{i}] \\
&= \frac{1}{N} \sum_{i=1}^{N} \mu \\
&= \mu
\end{align}
$
Therefore, bias of the estimator $\mathbb{E}[\mu_{ML}] - \mu = 0$
In case of variance,
$
\begin{align}
\mathbb{E}_{D \sim p(D \mid \mu, \sigma)}\left[\sigma_{M L}^{2}\right]&=E\left[\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}-\frac{1}{N} \sum_{n=1}^{N} x_{n}\right)^{2}\right] \\
&=\frac{1}{N} \sum_{i=1}^{N} E\left[\left(x i-\frac{1}{N} \sum_{n=1}^{N} x_{n}\right)^{2}\right]\\
&= \frac{1}{N} \sum_{i=1}^{N} E\left[x_{i}^{2}-\frac{2 x_{i}}{N} \sum_{n=1}^{N} x_{n}+\frac{1}{N^{2}} \sum_{m=1}^{N} \sum_{n=1}^{N} x_{m} x_{n}\right]\\
&=\frac{1}{N} \sum_{i=1}^{N}\left\{E\left[x_{i}^{2}\right]-\frac{2}{N} \sum_{n=1}^{N} \mathbb{E}\left[x_{i} x_{n}\right]+\frac{1}{N^{2}} \sum_{m=1}^{N} \sum_{n=1}^{N} E\left[x_{m} x_{n}\right]\right\}
\end{align}
$
Now,
$
\begin{align}
E[x_i,x_i] = cov[x_i] \Rightarrow &E[x_i^2] - E[x_i]^2 = \sigma^2 \\
& E[x_i^2] = \mu^2 + \sigma^2\\
\\
E[x_i,x_j] = cov[x_i,x_j] \Rightarrow &E[x_i.x_j] - E[x_i]E[x_j] = 0 \\
& E[x_i.x_j] = \mu^2
&
\end{align}
$
Using these values in the expression above,
$
\begin{align}
\mathbb{E}_{D \sim p(D \mid \mu, \sigma)}\left[\sigma_{M L}^{2}\right]&= \frac{1}{N} \sum_{i=1}^{N}\left\{\mu^{2}+\sigma^{2}-\frac{2}{N}\left(N \mu^{2}+\sigma^{2}\right)+\frac{1}{N^{2}}\left(N^{2} \mu^{2}+N \sigma^{2}\right)\right\} \\
&= \frac{N-1}{N}\sigma^2
\end{align}
$
Therefore, bias of the estimator is $\frac{N-1}{N}$.
This means that MLE behaves in a biased way when N is small, but as $N\Rightarrow\infty$ the bias is reduced. This can also be adjusted by apply the transformation of $\frac{N}{N-1}$.