# Generative Adversarial Networks Generative - You can sample novel input samples. E.g., you can literally "create" images that never existed Adversarial - Our generative model $G$ learns adversarially by fooling an discriminative oracle model $D$. Network - Implemented typically as a (deep) neural network making it easy to incorporate new modules, easy to learn via backpropagation. ## Architecture The GAN comprises two neural networks Generator network $x=G\left(z ; \theta_{G}\right)$ Discriminator network $y=D\left(x ; \theta_{D}\right)=\left\{\begin{array}{l}+1, \text { if } x \text { is predicted 'real }^{\prime} \\ 0, \text { if } x \text { is predicted 'fake }^{\prime}\end{array}\right.$ ![[gan-arch.jpg]] Note: there is no 'encoder'. We cannot learn a representation for an image $x$. We cannot compute a likelihood of a specific data point. At test time we can only generate new data points. ### Generator network $ x=G\left(z ; \theta_{G}\right) $ - Can be any differentiable neural network - No invertibility requirement allowing more flexible modelling - Trainable for any size of $z$ - Various density functions for the noise variable $z$ ### Discriminator network $ \boldsymbol{y}=D\left(\boldsymbol{x} ; \boldsymbol{\theta}_{\mathrm{D}}\right) $ - Can beany differentiable neural network - Receives as inputs either real images from the training set or generated images from the generator, usually a mix of both in mini-batches - The discriminator must recognize the real from the fake inputs ### Pipeline ![[gan-pipeline.jpg]] ## Learning objectives - Not obvious how to use [[Maximum Likelihood Estimation]] - If we take the output of the generator, how to train the discriminator? - Even then, how do we know if a completely new $x$ is likely or not? Remember, we have no encoder, so no target to compare against. $ \begin{array}{lll} \hline \text { Symbol } & {\text { Meaning }} & {\text { Notes }} \\ \hline p_{z} & \text { Data distribution over noise input } z & \text { Usually, just uniform. } \\ p_{g} & \text { The generator's distribution over data } x & \\ p_{r} & \text { Data distribution over real sample } x & \\ \hline \end{array} $ ### Minimax Game For the simple case of zero-sum game $ J_{G}=-J_{D} $ The lower the generator loss, the higher the discriminator loss Symmetric definitions Our learning objective then becomes $ V=-J_{D}\left(\boldsymbol{\theta}_{\mathrm{D}}, \boldsymbol{\theta}_{\mathrm{G}}\right) $ $D(x)=1$ -> The discriminator believes that $x$ is a true image $D(G(z))=1$ -> The discriminator believes that $G(z)$ is a true image Learning stops after a while. As training iterations increase the discriminator improves: $\frac{d J_{D}}{d \theta_{\mathrm{D}}} \rightarrow 0$ Then, the generator, preceding the discriminator, vanish gradients. - Equilibrium is a saddle point of the discriminator loss - Final loss resembles Jenssen-Shannon divergence - This allows for easier theoretical analysis ### Heuristic non-saturating game This is the most widely used objective. Discriminator loss $ J_{D}=-\frac{1}{2} \mathbb{E}_{x \sim p_{\text {data}}} \log D(x)-\frac{1}{2} \mathbb{E}_{z \sim p_{z}} \log (1-D(G(z))) $ Generator loss $ J_{G}=-\frac{1}{2} \mathbb{E}_{z \sim p_{z}} \log (D(G(z)) $ Equilibrium not any more describable by single loss - The discriminator maximizes the log-likelihood of the discriminator correctly discovering real $\log D(x)$ and fake $\log (1-D(G(z)))$ samples - The generator $G(z)$ maximizes the log-likelihood of the discriminator $\log (D(G(z))$ being wrong. Doesn't care if $D$ gets confused with real samples. Heuristically motivated; generator can still learn even when discriminator successfully rejects all generator samples. $ \min _{G} \max _{D} V(D, G)=\min _{G} \max _{D} \mathbb{E}_{p_{\text {data }}(x)}[\log D(X)]+\mathbb{E}_{p_{z}(z)}[\log (1-D(G(Z)))] $ There are two terms in the above GAN training objective. The first term maximizes the log-probability of discriminator predicting that real-world data as correct. The second term maximizes the log-probability of discriminator predicting that generated data by generator as incorrect. The generator, on the other hand minimizes the log-probability of the discriminator being correct. ![[gan-schematic.jpeg]] [Image Credit](https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b) ### Maximum likelihood cost We can modify for maximum likelihood by keeping discriminator loss the same as above and generator activaing by an inverse sigmoid. $ J_{G}=-\frac{1}{2} \mathbb{E}_{z} \log \left(\sigma^{-1}(D(G(z)))\right. $ In this case, when discriminator is optimal $\frac{d J_{D}}{d \theta_{D}} \rightarrow 0$, the generator gradient matches that of maximum likelihood. ### Comparision of generator losses ![[generator-losses.jpg]] ## Optimial discriminator Optimal $D(x)$ for any $p_{\text {data}}(x)$ and $p_{\text {model}}(x)$ is always $ D(x)=\frac{p_{\text {data}}(\boldsymbol{x})}{p_{\text {data}}(\boldsymbol{x})+p_{\text {model}}(\boldsymbol{x})} $ Estimating this ratio with supervised learning (discriminator) is the key. How is this optimial discriminator? $L(D, G)=\int_{x} p_{r}(x) \log D(x)+p_{g}(x) \log (1-D(x)) d x$ - Minimize $\mathcal{L}(D, G)$ w.r.t. $D \rightarrow \frac{d \mathcal{L}}{d D}=0$ and ignore the integral (sample over $\left.x\right)$ - The function $x \rightarrow a \log x+b \log (1-x)$ attains $\max$ in [0,1] at $\frac{a}{a+b}$ The optimial discriminator $ D^{*}(x)=\frac{p_{r}(x)}{p_{r}(x)+p_{g}(x)} $ And at optimality $p_{g}(\boldsymbol{x}) \rightarrow p_{r}(\boldsymbol{x})$, thus $ \begin{aligned} & D^{*}(\boldsymbol{x})=\frac{1}{2} \\ & L\left(G^{*}, D^{*}\right)=-2 \log 2 \end{aligned} $ ## GAN and Jensen-Shannon divergence Expanding the [[Jensen–Shannon Divergence]] for the optimal discriminator $D^{*}(\boldsymbol{x})=\frac{p_{r}(\boldsymbol{x})}{p_{r}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}$, $ \begin{array}{c} D_{J S}\left(p_{r} \| p_{g}\right)=\frac{1}{2} D_{K L}\left(p_{r} \| \frac{p_{r}+p_{g}}{2}\right)+\frac{1}{2} D_{K L}\left(p_{g} \| \frac{p_{r}+p_{g}}{2}\right) \\ =\frac{1}{2}\left(\log 2+\int_{\chi} p_{r}(x) \log \frac{p_{r}(x)}{p_{r}(x)+p_{g}(x)} d x+\log 2+\int_{x} p_{g}(x) \log \frac{p_{g}(x)}{p_{r}(x)+p_{g}(x)} d x\right) \\ =\frac{1}{2}\left(\log 4+L\left(G, D^{*}\right)\right) \end{array} $ So, its interesting to see that $L\left(G, D^{*}\right)=2 D_{J S}\left(p_{r} \| p_{g}\right)-2 \log 2$, and for $L\left(G^{*}, D^{*}\right)\Rightarrow D_{J S}\left(p_{r} \| p_{g}\right)=0$. So GANs are optimizing rescaled version of JS Divergence. Some believe (Huszar, 2015) that one reason behind GANs’ big success is switching the loss function from asymmetric [[KL Divergence]] in traditional maximum-likelihood approach to symmetric [[Jensen–Shannon Divergence]]. How? $D_{K L}\left(p(x) \| q^{*}(x)\right)$ -> high probability everywhere that the data occurs $D_{K L}\left(q^{*}(x) \| p(x)\right)$ -> low probability wherever the data does not occur [[KL Divergence#Forward and backward KL|Backward KL]] is 'zero forcing' the learned model -> makes model "conservative" and avoids area where $p(x)=0$. ![[kl-backward-forward.jpg]] ## Other GAN cost functions ![[gan_cost functions.jpeg]] --- ## References 1. NeurIPS GAN Workshop, 2014 2. Lecture 10.2, UvA DL course 2020 3. Lilian Weng's post on GANs https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html#what-is-the-global-optimal 4. Why is it so hard to train GANs by Jonathan Hui https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b 5. Ways to improve GAN performance by Jonathan Hui https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b