Maximum Mean Discrepancy (MMD)

# Maximum Mean Discrepancy (MMD) Maximum Mean Discrepancy (MMD) is a statistical measure used to determine whether two probability distributions are different by comparing their sample means in a reproducing kernel Hilbert space (RKHS). Essentially, MMD maps data points from both distributions into a high-dimensional feature space using a kernel function (like RBF/Gaussian kernel), then computes the distance between the mean embeddings of the two distributions in that space. If the distributions are the same, their mean embeddings should be identical. The MMD between distributions P and Q is defined as: $\operatorname{MMD}^2(P, Q)=\left\|\mu_P-\mu_Q\right\|_{\mathcal{H}}^2$ where the mean embeddings in RKHS H are: $ \begin{aligned} & \mu_P=\mathbb{E}_{x \sim P}[K(\cdot, x)]=\int_\chi K(\cdot, x) d P(x) \\ & \mu_Q=\mathbb{E}_{y \sim Q}[K(\cdot, y)]=\int_\chi K(\cdot, y) d Q(y) \end{aligned} $ MMD is particularly useful when you need a computationally tractable distance measure between empirical distributions without requiring density estimation. Traditional approach might require: - Large datasets to reliably estimate $\mathrm{P}\left(\mathrm{f}(\mathrm{X}) \mid \mathrm{Z}=\mathrm{Z}_1\right)$ vs $\mathrm{P}\left(\mathrm{f}(\mathrm{X}) \mid \mathrm{Z}=\mathrm{Z}_2\right)$ - Parametric assumptions about distributions - Complex statistical testing procedures MMD approach: - Works with even just a few samples per groups - Directly measures what we care about: distributional similarity - Integrates seamlessly into modern deep learning pipelines ## Key Differences from other measures Two widely used divergence/distance measures comparable with MMD could be [[KL Divergence]] or [[Wasserstein Distance]]. **Computational Properties:** - **MMD**: Easy to compute from samples using kernel trick, O(n²) complexity - **KL Divergence**: Requires explicit density estimation, which can be challenging in high dimensions - **Wasserstein**: Requires solving optimal transport problem, computationally expensive (though approximations exist) **Symmetry:** - **MMD**: Symmetric - MMD(P,Q) = MMD(Q,P) - **KL Divergence**: Asymmetric - KL(P||Q) ≠ KL(Q||P) in general - **Wasserstein**: Symmetric - W(P,Q) = W(Q,P) **Theoretical Properties:** - **MMD**: Depends on kernel choice; with characteristic kernels (like Gaussian), it's a proper metric - **KL Divergence**: Information-theoretic measure; infinite when distributions have non-overlapping supports - **Wasserstein**: True metric that considers geometric structure of the space **Practical Applications:** - **MMD**: Popular in generative modeling (GANs), two-sample testing, domain adaptation - **KL Divergence**: Variational inference, model selection, information theory - **Wasserstein**: Generative models (Wasserstein GANs), optimal transport problems **Sensitivity:** - **MMD**: Less sensitive to outliers due to kernel averaging - **KL Divergence**: Very sensitive to tail behavior and zero-probability regions - **Wasserstein**: Considers geometric distance, more robust to distribution support differences