Monte-Carlo Estimation

# Monte-Carlo Estimation ## Motivation We are often interested to compute quantities (statistics) on random variables. For ex, - the average response to a drug - probability of a particular sum when throwing two dice - average reconstructions in VAE given an input. These statistics often intractable to compute. - Cannot derive a perfect drug response model (too complex) - Cannot enumerate all possible dice combinations (too lazy) - Computationally infeasible (intractable integrals) ## Monte Carlo Integration Use random sampling instead of analytical computation. A single random sample might not be enough. Many random samples can give us a reliable quantification. E.g, by throwing dice many times we can obtain a histogram of probabilities for each possible sum. If we throw dice once, the histogram will be very wrong (just a single bar). But if we repeat hundreds of times and average, we are gonna get very close (law of large numbers). More formally, in MC integration we treat inputs $x$ as random variables with pdf $p(x)$. Our desired statistic $y$ is the output and integrate over all possible $x$ $ y=\int_{x} f(x) p(x) d x $ This integral is equivalent to an expectation $ y = \int_{x} f(x) p(x) d x = \mathbb{E}_{x \sim p(x)}[f(x)] $ We can approximate this expectation by random sampling and summation $y=\mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right)=\hat{y}$ where $x_{i}$ is sampled from $p(x)$. Note that $\hat{y}$ is an estimator because it only approximately estimates the value of $y$. If we want to compute a quantity $y$ that we can express it as an integral of a function $f$ over a probability space $x$ that has a known and easy to sample pdf $p(x)$ we can replace the exact but intractable computation with a tractable $\mathrm{MC}$ estimator $ y=\mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right), x_{i} \sim p(x) $ If we can't translate the quantity as such an integral, we can't estimate it. For instance, we cannot use $\mathrm{MC}$ on the following because neither the $\log p(x \mid z)$ nor the $\nabla_{\varphi} q_{\varphi}(z \mid x)$ are probability densities: $ \nabla_{\varphi} \mathbb{E}_{\mathbf{z} \sim q_{\varphi}(\mathbf{z} \mid x)}[\log p(\boldsymbol{x} \mid \mathbf{z})]=\int_{Z} \log p(\boldsymbol{x} \mid \mathbf{z}) \nabla_{\varphi} q_{\varphi}(\mathbf{z} \mid \boldsymbol{x}) d \mathbf{z} $ ## Estimator mean and variance Our estimator is itself a random variable $\rightarrow$. It has its own mean $\mu_{\hat{y}}=\mathbb{E}[\hat{y}]$ and variance $\operatorname{Var}[\hat{y}]=\mathbb{E}\left[\left(\hat{y}-\mu_{\hat{y}}\right)^{2}\right]$. The higher the variance, the more the estimation fluctuates after every new experiment. ### Estimator bias An estimator is unbiased if its expected value equals the true parameter. For example, the sample mean x̄ is an unbiased estimator of the population mean μ because E[x̄] = μ. $ \mathbb{E}[\hat{y}]=y $ Otherwise, biased with bias $ \text { bias }=\mathbb{E}[\hat{y}]-y $ It is better to have unbiased estimators, but in some cases a bit of bias is ok. Trade tractability for less accurate solutions (than what could be). MC estimators are unbiased due to linearity of expectation. For estimating $\mathbb{E}[f(X)]$ using $\hat{\theta} = \frac{1}{n}\sum_{i=1}^n f(X_i)$: $\mathbb{E}[\hat{\theta}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n f(X_i)\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[f(X_i)] = \mathbb{E}[f(X)]$ The expectation operator "pushes through" the summation and averaging operations, giving us the true value. This holds for any sample size $n$, making the estimator unbiased regardless of how many samples we use. This is why MC methods are so powerful - you get this theoretical guarantee basically for free, regardless of how complex your function $f(X)$ is or what distribution you're sampling from (though of course it takes a LOT of samples for good accuracy). Note: The Law of Large Numbers ensures the estimator is also consistent (converges to the true value as $n \to \infty$), but unbiasedness is a separate property. ## Standard error of MC estimator The MC estimator is a sample mean $ \mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{n} \sum_{i} f\left(x_{i}\right) $ The standard error of a sample mean is $ \sigma_{\hat{f}}=\frac{\sigma}{\sqrt{n}} $ The more samples we take the less the estimator deviates. But the deviation reduces only as $\sqrt{n}$. With $4 x$ more samples we only improve our error two times.