# Importance Sampling Importance sampling is a mathematical technique used to estimate properties of a particular probability distribution, while only having samples from a different distribution. Suppose you want to estimate the expected value of a function$f(x)$under a probability distribution $p(x)$, but you only have samples from a different distribution $q(x)$. Importance sampling can be used to provide an unbiased estimate of this expected value. Mathematically, the expected value of$f(x)$under $p(x)$ is given by: $ \mathbb{E}_{p}[f(x)] = \int f(x)p(x)dx $ However, you only have samples from $q(x)$. The importance sampling estimator uses a weighted average of the function values at the sampled points, with weights given by the ratio of the probabilities under the two distributions. The estimator is given by: $ \hat{\mathbb{E}}_{p}[f(x)] = \frac{1}{N} \sum_{i=1}^{N} \frac{p(x_i)}{q(x_i)} f(x_i) $ where $x_i$are samples drawn from $q(x)$, and $N$ is the total number of samples. The ratio$\frac{p(x_i)}{q(x_i)}$is known as the importance weight and corrects for the fact that the samples are drawn from $q(x)$rather than $p(x)$. This ratio adjusts the contribution of each sample to reflect how likely it would be under the distribution of interest $p(x)$. An important aspect to consider in importance sampling is the choice of$q(x)$. Ideally, $q(x)$ should be close to $p(x)$and should also have a heavy tail if $f(x)$has a heavy tail. If$q(x)$is very different from $p(x)$, the importance weights can vary greatly, leading to high variance in the estimate. This is a common challenge in implementing importance sampling effectively. ## Application Examples **Off-policy RL:** Evaluate target policy π using data from behavior policy μ: $V^π(s) = \mathbb{E}_μ\left[\frac{π(A_t|S_t)}{μ(A_t|S_t)} G_t | S_t = s\right]$ **Policy optimization methods:** [[TRPO - Trust-Region Policy Optimization]] optimizes surrogate objective with KL constraint: $L(\theta) = \mathbb{E}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A(s,a)\right]$ $\text{s.t. } \mathbb{E}[D_{KL}(\pi_{\theta_{old}}(\cdot|s) || \pi_\theta(\cdot|s))] \leq \delta$ [[PPO - Proximal Policy Optimization]] uses clipped importance sampling ratio: $L(\theta) = \mathbb{E}\left[\min\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A(s,a), \text{clip}\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}, 1-\epsilon, 1+\epsilon\right) A(s,a)\right)\right]$ where - **π_θ**: The *current* policy being updated (parameterized by θ) - **π_θ_old**: The *previous* policy from the last iteration - **π*** (optimal policy): What we're ultimately trying to converge to The process works iteratively: 1. Collect data using π_θ_old 2. Use importance sampling to estimate how π_θ would perform on that data 3. Update θ to improve the policy 4. Repeat: π_θ becomes the new π_θ_old The importance sampling ratio π_θ(a|s)/π_θ_old(a|s) lets us reuse old data to evaluate the new policy, avoiding the need to collect fresh data at every step (which would be expensive). So π_θ is our current best guess at the optimal policy, which we keep improving until hopefully π_θ ≈ π*.