# Monte-Carlo Estimation
## Motivation
We are often interested to compute quantities (statistics) on random variables. For ex,
- the average response to a drug
- probability of a particular sum when throwing two dice
- average reconstructions in VAE given an input.
These statistics often intractable to compute.
- Cannot derive a perfect drug response model (too complex)
- Cannot enumerate all possible dice combinations (too lazy)
- Computationally infeasible (intractable integrals)
## Monte Carlo Integration
Use random sampling instead of analytical computation. A single random sample might not be enough. Many random samples can give us a reliable quantification. E.g, by throwing dice many times we can obtain a histogram of probabilities for each possible sum. If we throw dice once, the histogram will be very wrong (just a single bar). But if we repeat hundreds of times and average, we are gonna get very close (law of large numbers).
More formally, in MC integration we treat inputs $x$ as random variables with pdf $p(x)$. Our desired statistic $y$ is the output and integrate over all possible $x$
$
y=\int_{x} f(x) p(x) d x
$
This integral is equivalent to an expectation
$
y = \int_{x} f(x) p(x) d x = \mathbb{E}_{x \sim p(x)}[f(x)]
$
We can approximate this expectation by random sampling and summation
$y=\mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right)=\hat{y}$ where $x_{i}$ is sampled from $p(x)$.
Note that $\hat{y}$ is an estimator because it only approximately estimates the value of $y$.
If we want to compute a quantity $y$ that we can express it as an integral of a function $f$ over a probability space $x$ that has a known and easy to sample pdf $p(x)$ we can replace the exact but intractable computation with a tractable $\mathrm{MC}$ estimator
$
y=\mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right), x_{i} \sim p(x)
$
If we can't translate the quantity as such an integral, we can't estimate it. For instance, we cannot use $\mathrm{MC}$ on the following because neither the $\log p(x \mid z)$ nor the $\nabla_{\varphi} q_{\varphi}(z \mid x)$ are probability densities:
$
\nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(\mathbf{z} \mid x)}[\log p(\boldsymbol{x} \mid \mathbf{z})]=\int_{Z} \log p(\boldsymbol{x} \mid \mathbf{z}) \nabla_{\varphi} q_{\varphi}(\mathbf{z} \mid \boldsymbol{x}) d \mathbf{z}
$
## Estimator mean and variance
Our estimator is itself a random variable $\rightarrow$. It has its own mean $\mu_{\hat{y}}=\mathbb{E}[\hat{y}]$ and variance $\operatorname{Var}[\hat{y}]=\mathbb{E}\left[\left(\hat{y}-\mu_{\hat{y}}\right)^{2}\right]$.
The higher the variance, the more the estimation fluctuates after every new experiment.
### Estimator bias
An estimator is unbiased if its expected value equals the true parameter. For example, the sample mean x̄ is an unbiased estimator of the population mean μ because E[x̄] = μ.
$
\mathbb{E}[\hat{y}]=y
$
Otherwise, biased with bias
$
\text { bias }=\mathbb{E}[\hat{y}]-y
$
It is better to have unbiased estimators, but in some cases a bit of bias is ok. Trade tractability for less accurate solutions (than what could be).
MC estimators are unbiased due to linearity of expectation.
For estimating $\mathbb{E}[f(X)]$ using $\hat{\theta} = \frac{1}{n}\sum_{i=1}^n f(X_i)$:
$\mathbb{E}[\hat{\theta}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n f(X_i)\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[f(X_i)] = \mathbb{E}[f(X)]$
The expectation operator "pushes through" the summation and averaging operations, giving us the true value. This holds for any sample size $n$, making the estimator unbiased regardless of how many samples we use.
This is why MC methods are so powerful - you get this theoretical guarantee basically for free, regardless of how complex your function $f(X)$ is or what distribution you're sampling from (though of course it takes a LOT of samples for good accuracy).
Note: The Law of Large Numbers ensures the estimator is also consistent (converges to the true value as $n \to \infty$), but unbiasedness is a separate property.
## Standard error of MC estimator
The MC estimator is a sample mean
$
\mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right)
$
The standard error of a sample mean is
$
\sigma_{\hat{f}}=\frac{\sigma}{\sqrt{n}}
$
The more samples we take the less the estimator deviates. But the deviation reduces only as $\sqrt{n}$. With $4 x$ more samples we only improve our error two times.