# Challenges of GAN ## Vanishing Gradients $ \begin{array}{l} J_{D}=-\frac{1}{2} \mathbb{E}_{x \sim p_{\text {data}}} \log D(x)-\frac{1}{2} \mathbb{E}_{z} \log (1-D(G(z)) \\ J_{G}=-\frac{1}{2} \mathbb{E}_{z} \log (D(G(z)) \end{array} $ If the discriminator is quite bad - the generator gets confused - no reasonable generator gradients If the discriminator is perfect - gradients go to 0, so no learning anymore This is unfortunate if early in the training. It is specially possible as it is much easier to train the discriminator to classify than generator to generate samples. ## Low dimensional supports Data lie in low-dim manifolds. However, the manifold is not known. During training $p_{g}$ is not perfect either, especially in the start. So, the support of $p_{r}$ and $p_{g}$ is nonoverlapping and disjoint, which is not good for KL/JS divergences. This means it is very easy to find a discriminating line for discriminator. Possible solution is to use a distance measure that gives meaningful distance even when the supports do not overlap as in [[Wassertein GAN]]. ## Batch Normalization does not work right away [[Normalization#Batch normalization]] causes strong intra-batch correlation. With batchnorm, the activations depend on other inputs through the mean and standard deviation of the batch. This implies generations also depend on other inputs. As a result, generations look smooth but wonky. ### Reference batch normalization Training with two mini-batches Fixed reference mini-batch to compute $\mu_{b n}^{\text {ref}}, \sigma_{b n}^{\text {ref}}$ Second mini-batch $x_{\text {batch}}$ for training Same training, only use $\mu_{b n}^{\text {ref}}, \sigma_{\text {bn}}^{\text {ref}}$ to normalize $x_{batch}$ Problem: Overfitting to the reference mini-batch ### Virtual batch normalization - Append the reference batch to regular mini-batch - GPU memory is a potential issue ## Balancing generator and discriminator Usually the discriminator wins. It is good, as the theoretical justification assumes a perfect discriminator. Usually the discriminator network is bigger than the generator Sometimes running discriminator more often than generator works better. However, no real consensus. Do not limit the discriminator to avoid making it too smart. Making learning 'easier' will not necessarily make generation better. It's better to use theoritically sound techniques like use non-saturating cost and label smoothing. ## Convergence Optimization is tricky and unstable, as finding a saddle point does not imply a global minimum. A saddle point is also sensitive to disturbances. An equilibrium might not even be reached. Mode-collapse is the most severe form of non-convergence. ## Mode collapse - Discriminator converges to the correct distribution - Generator however places all mass in the most likely point - All other modes are ignored Underestimating variance - Low diversity in generating samples ![[gan-mode-collapse.png]] The generator learns to only produce a particular output that was good at fooling D. D then learns to always reject that output and might overfit to the current degenerate G, leading to an increase in the loss we observe. The next generation of G then can easily find a new output which is accepted by the overfitted D, leading to the decrease in the loss. This cycle repeats. ### Minibatch features A solution is to classify each sample by comparing to other examples in the mini-batch. If samples are too similar, the mode is penalized. ## Lack of proper evaluation metric Despite the nice images, how do we evaluate them? It would be nice to quantitatively evaluate the model. For GANs it is hard to even estimate the likelihood. In the absence of a precise evaluation metric, do GANs do truly good generations or generations that appeal/fool to the human eye? Can we trust the generations for critical applications, like medical tasks? Are humans a good discriminator for the converged generator? ## Beyond images The generator must be differentiable. Tasks with discrete outputs (like text) are ruled out, modifications are necessary to flow gradients through discrete variables. We can overcome this by letting G output the parameters of a distribution over the discrete values, sampling some $\epsilon$ and using the [[Stochastic gradients#Pathwise gradient estimator]] (or reparameterization trick). This means that the sampled value is differentiable wrt to the parameters output. Additionally we might have to use [[Normalizing Flows#Variational Dequantization]] to obtain a differentiable distribution. Other types of structured data like graphs is also an open problem. ## Improvement: Feature matching Instead of training generator to generate image samples, use it to generate features $ J_{D}=\left\|\mathbb{E}_{x \sim p_{\text {data}}} f(x)-\mathbb{E}_{z \sim p(z)} f(G(z))\right\|_{2}^{2} $ $f$ can be any statistic of the data, like the mean or the median ## Use labels if possible - Learning a conditional model $p(y \mid x)$ is often generates better samples Denton et al., 2015 - Even learning $p(x, y)$ makes samples look more realistic Salimans et al., 2016 - Conditional GANs are a great addition for learning with labels ## One-sided label smoothing One-sided label smoothing: cross_entropy(0.9,discriminator (data)) + cross_entropy(0., discriminator(samples)) Do not smooth negative labels: cross_entropy (1.-alpha, discriminator(data)) + cross_entropy(beta, discriminator(samples)) Benefits - Max likelihood often is overconfident, might return accurate prediction, but too high probabilities. - Good regularizer Szegedy et al., 2015 - Does not reduce classification accuracy, only confidence - Specifically for GANs - Prevents discriminator from giving very large gradient signals to generator - Prevents extrapolating to encourage extreme samples --- ## References