# Importance Sampling
Importance sampling is a mathematical technique used to estimate properties of a particular probability distribution, while only having samples from a different distribution.
Suppose you want to estimate the expected value of a function$f(x)$under a probability distribution $p(x)$, but you only have samples from a different distribution $q(x)$. Importance sampling can be used to provide an unbiased estimate of this expected value.
Mathematically, the expected value of$f(x)$under $p(x)$ is given by:
$
\mathbb{E}_{p}[f(x)] = \int f(x)p(x)dx
$
However, you only have samples from $q(x)$. The importance sampling estimator uses a weighted average of the function values at the sampled points, with weights given by the ratio of the probabilities under the two distributions. The estimator is given by:
$
\hat{\mathbb{E}}_{p}[f(x)] = \frac{1}{N} \sum_{i=1}^{N} \frac{p(x_i)}{q(x_i)} f(x_i)
$
where $x_i$are samples drawn from $q(x)$, and $N$ is the total number of samples.
The ratio$\frac{p(x_i)}{q(x_i)}$is known as the importance weight and corrects for the fact that the samples are drawn from $q(x)$rather than $p(x)$. This ratio adjusts the contribution of each sample to reflect how likely it would be under the distribution of interest $p(x)$.
An important aspect to consider in importance sampling is the choice of$q(x)$. Ideally, $q(x)$ should be close to $p(x)$and should also have a heavy tail if $f(x)$has a heavy tail. If$q(x)$is very different from $p(x)$, the importance weights can vary greatly, leading to high variance in the estimate. This is a common challenge in implementing importance sampling effectively.
## Application Examples
**Off-policy RL:** Evaluate target policy π using data from behavior policy μ:
$V^π(s) = \mathbb{E}_μ\left[\frac{π(A_t|S_t)}{μ(A_t|S_t)} G_t | S_t = s\right]$
**Policy optimization methods:**
[[TRPO - Trust-Region Policy Optimization]] optimizes surrogate objective with KL constraint:
$L(\theta) = \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A(s,a)\right]$
$\text{s.t. } \mathbb{E}[D_{KL}(\pi_{\theta_{old}}(\cdot|s) || \pi_\theta(\cdot|s))] \leq \delta$
[[PPO - Proximal Policy Optimization]] uses clipped importance sampling ratio:
$L(\theta) = \mathbb{E}\left[\min\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A(s,a), \text{clip}\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}, 1-\epsilon, 1+\epsilon\right) A(s,a)\right)\right]$
where
- **π_θ**: The *current* policy being updated (parameterized by θ)
- **π_θ_old**: The *previous* policy from the last iteration
- **π*** (optimal policy): What we're ultimately trying to converge to
The process works iteratively:
1. Collect data using π_θ_old
2. Use importance sampling to estimate how π_θ would perform on that data
3. Update θ to improve the policy
4. Repeat: π_θ becomes the new π_θ_old
The importance sampling ratio π_θ(a|s)/π_θ_old(a|s) lets us reuse old data to evaluate the new policy, avoiding the need to collect fresh data at every step (which would be expensive).
So π_θ is our current best guess at the optimal policy, which we keep improving until hopefully π_θ ≈ π*.