# Generative Adversarial Networks
Generative - You can sample novel input samples. E.g., you can literally "create" images that never existed
Adversarial - Our generative model $G$ learns adversarially by fooling an discriminative oracle model $D$.
Network - Implemented typically as a (deep) neural network making it easy to incorporate new modules, easy to learn via backpropagation.
## Architecture
The GAN comprises two neural networks
Generator network $x=G\left(z ; \theta_{G}\right)$
Discriminator network $y=D\left(x ; \theta_{D}\right)=\left\{\begin{array}{l}+1, \text { if } x \text { is predicted 'real }^{\prime} \\ 0, \text { if } x \text { is predicted 'fake }^{\prime}\end{array}\right.$
![[gan-arch.jpg]]
Note: there is no 'encoder'. We cannot learn a representation for an image $x$. We cannot compute a likelihood of a specific data point. At test time we can only generate new data points.
### Generator network
$
x=G\left(z ; \theta_{G}\right)
$
- Can be any differentiable neural network
- No invertibility requirement allowing more flexible modelling
- Trainable for any size of $z$
- Various density functions for the noise variable $z$
### Discriminator network
$
\boldsymbol{y}=D\left(\boldsymbol{x} ; \boldsymbol{\theta}_{\mathrm{D}}\right)
$
- Can beany differentiable neural network
- Receives as inputs either real images from the training set or generated images from the generator, usually a mix of both in mini-batches
- The discriminator must recognize the real from the fake inputs
### Pipeline
![[gan-pipeline.jpg]]
## Learning objectives
- Not obvious how to use [[Maximum Likelihood Estimation]]
- If we take the output of the generator, how to train the discriminator?
- Even then, how do we know if a completely new $x$ is likely or not? Remember, we have no encoder, so no target to compare against.
$
\begin{array}{lll}
\hline \text { Symbol } & {\text { Meaning }} & {\text { Notes }} \\
\hline p_{z} & \text { Data distribution over noise input } z & \text { Usually, just uniform. } \\
p_{g} & \text { The generator's distribution over data } x & \\
p_{r} & \text { Data distribution over real sample } x & \\
\hline
\end{array}
$
### Minimax Game
For the simple case of zero-sum game
$
J_{G}=-J_{D}
$
The lower the generator loss, the higher the discriminator loss
Symmetric definitions
Our learning objective then becomes
$
V=-J_{D}\left(\boldsymbol{\theta}_{\mathrm{D}}, \boldsymbol{\theta}_{\mathrm{G}}\right)
$
$D(x)=1$ -> The discriminator believes that $x$ is a true image
$D(G(z))=1$ -> The discriminator believes that $G(z)$ is a true image
Learning stops after a while. As training iterations increase the discriminator improves: $\frac{d J_{D}}{d \theta_{\mathrm{D}}} \rightarrow 0$ Then, the generator, preceding the discriminator, vanish gradients.
- Equilibrium is a saddle point of the discriminator loss
- Final loss resembles Jenssen-Shannon divergence
- This allows for easier theoretical analysis
### Heuristic non-saturating game
This is the most widely used objective.
Discriminator loss
$
J_{D}=-\frac{1}{2} \mathbb{E}_{x \sim p_{\text {data}}} \log D(x)-\frac{1}{2} \mathbb{E}_{z \sim p_{z}} \log (1-D(G(z)))
$
Generator loss
$
J_{G}=-\frac{1}{2} \mathbb{E}_{z \sim p_{z}} \log (D(G(z))
$
Equilibrium not any more describable by single loss
- The discriminator maximizes the log-likelihood of the discriminator correctly discovering real $\log D(x)$ and fake $\log (1-D(G(z)))$ samples
- The generator $G(z)$ maximizes the log-likelihood of the discriminator $\log (D(G(z))$ being wrong. Doesn't care if $D$ gets confused with real samples.
Heuristically motivated; generator can still learn even when discriminator successfully rejects all generator samples.
$
\min _{G} \max _{D} V(D, G)=\min _{G} \max _{D} \mathbb{E}_{p_{\text {data }}(x)}[\log D(X)]+\mathbb{E}_{p_{z}(z)}[\log (1-D(G(Z)))]
$
There are two terms in the above GAN training objective. The first term maximizes the log-probability of discriminator predicting that real-world data as correct. The second term maximizes the log-probability of discriminator predicting that generated data by generator as incorrect.
The generator, on the other hand minimizes the log-probability of the discriminator being correct.
![[gan-schematic.jpeg]]
[Image Credit](https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b)
### Maximum likelihood cost
We can modify for maximum likelihood by keeping discriminator loss the same as above and generator activaing by an inverse sigmoid.
$
J_{G}=-\frac{1}{2} \mathbb{E}_{z} \log \left(\sigma^{-1}(D(G(z)))\right.
$
In this case, when discriminator is optimal $\frac{d J_{D}}{d \theta_{D}} \rightarrow 0$, the generator gradient matches that of maximum likelihood.
### Comparision of generator losses
![[generator-losses.jpg]]
## Optimial discriminator
Optimal $D(x)$ for any $p_{\text {data}}(x)$ and $p_{\text {model}}(x)$ is always
$
D(x)=\frac{p_{\text {data}}(\boldsymbol{x})}{p_{\text {data}}(\boldsymbol{x})+p_{\text {model}}(\boldsymbol{x})}
$
Estimating this ratio with supervised learning (discriminator) is the key.
How is this optimial discriminator?
$L(D, G)=\int_{x} p_{r}(x) \log D(x)+p_{g}(x) \log (1-D(x)) d x$
- Minimize $\mathcal{L}(D, G)$ w.r.t. $D \rightarrow \frac{d \mathcal{L}}{d D}=0$ and ignore the integral (sample over $\left.x\right)$
- The function $x \rightarrow a \log x+b \log (1-x)$ attains $\max$ in [0,1] at $\frac{a}{a+b}$
The optimial discriminator
$
D^{*}(x)=\frac{p_{r}(x)}{p_{r}(x)+p_{g}(x)}
$
And at optimality $p_{g}(\boldsymbol{x}) \rightarrow p_{r}(\boldsymbol{x})$, thus
$
\begin{aligned}
& D^{*}(\boldsymbol{x})=\frac{1}{2} \\
& L\left(G^{*}, D^{*}\right)=-2 \log 2
\end{aligned}
$
## GAN and Jensen-Shannon divergence
Expanding the [[Jensen–Shannon Divergence]] for the optimal discriminator $D^{*}(\boldsymbol{x})=\frac{p_{r}(\boldsymbol{x})}{p_{r}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}$,
$
\begin{array}{c}
D_{J S}\left(p_{r} \| p_{g}\right)=\frac{1}{2} D_{K L}\left(p_{r} \| \frac{p_{r}+p_{g}}{2}\right)+\frac{1}{2} D_{K L}\left(p_{g} \| \frac{p_{r}+p_{g}}{2}\right) \\
=\frac{1}{2}\left(\log 2+\int_{\chi} p_{r}(x) \log \frac{p_{r}(x)}{p_{r}(x)+p_{g}(x)} d x+\log 2+\int_{x} p_{g}(x) \log \frac{p_{g}(x)}{p_{r}(x)+p_{g}(x)} d x\right) \\
=\frac{1}{2}\left(\log 4+L\left(G, D^{*}\right)\right)
\end{array}
$
So, its interesting to see that $L\left(G, D^{*}\right)=2 D_{J S}\left(p_{r} \| p_{g}\right)-2 \log 2$, and for $L\left(G^{*}, D^{*}\right)\Rightarrow D_{J S}\left(p_{r} \| p_{g}\right)=0$.
So GANs are optimizing rescaled version of JS Divergence.
Some believe (Huszar, 2015) that one reason behind GANs’ big success is switching the loss function from asymmetric [[KL Divergence]] in traditional maximum-likelihood approach to symmetric [[Jensen–Shannon Divergence]]. How?
$D_{K L}\left(p(x) \| q^{*}(x)\right)$ -> high probability everywhere that the data occurs
$D_{K L}\left(q^{*}(x) \| p(x)\right)$ -> low probability wherever the data does not occur
[[KL Divergence#Forward and backward KL|Backward KL]] is 'zero forcing' the learned model -> makes model "conservative" and avoids area where $p(x)=0$.
![[kl-backward-forward.jpg]]
## Other GAN cost functions
![[gan_cost functions.jpeg]]
---
## References
1. NeurIPS GAN Workshop, 2014
2. Lecture 10.2, UvA DL course 2020
3. Lilian Weng's post on GANs https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html#what-is-the-global-optimal
4. Why is it so hard to train GANs by Jonathan Hui https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b
5. Ways to improve GAN performance by Jonathan Hui https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b