# PixelRNN
Decompose the data likelihood of an $n \times n$ image $p(x)=\prod_{i=1}^{n^{2}} p\left(x_{i} \mid x_{<i}\right)$
Each pixel conditional corresponds to a triplet of colors-> Further decompose per color (same as above)
$
p\left(x_{i} \mid x_{<i}\right)=p\left(x_{i, R} \mid x_{<i}\right) \cdot p\left(x_{i, G} \mid x_{<i}, x_{i, R}\right) \cdot p\left(x_{i, B} \mid x_{<i}, x_{i, R}, x_{i, G}\right)
$
Model the conditionals $p\left(x_{i, R} \mid x_{<i}\right), \ldots$ with 12 -layer convolutional RNN. The MLP from [[Neural Autoregressive Density Estimation|NADE]] cannot easily scale and statistics are not shared.
Model the output as a categorical distribution 256-way softmax.
PixelRNN uses two variant of [[LSTM]]. Why not a regular LSTM? It requires sequential, pixel-wise computations which leads to less parallelization and slower training.
With Row LSTM and Diagonal BiLSTM, we process one row at a time, so parallelization is possible.
![[pixelrnn-lstms 1.jpg]]
## Row LSTM
Row LSTM with 'causal' triangular receptive field.
- Per new pixel (row $i$ ) use 1 -d conv (size 3 ) to aggregate pixels above $(i-1)$
- The effective receptive field spans a triangle
- Convolution only on 'past' pixels $(i-1),$ not 'future pixels' -> causal
- Loses some context because of the triangular nature of receptive field
## Diagonal BiLSTM
Proposed to address lost context. Key idea: Have two LSTMs moving on oppose diagonals
First diagonal: the convolution past is $(i-1, j),(i, j-1)$
Combine the two LSTMs, recursively the entirety of past context is captured.
## Architecture
- Use 12 layers of LSTMs
- Add residual connections to speed up learning
- Although good modelling of $p(x)$, it has nice image generation, but slow training and generation because of LSTMs.
![[pixelrnn-layers.jpg]]
## Generations
No collapse to single mode, lots of variation for same occluded images.
![[pixelrnn-generation.jpg]]
## PixelCNN
Replace LSTMs with fully convolutional 15 layers , no pooling layers to preserve spatial resolution
Use masks to mask out future pixels in convolutions, otherwise 'access to future' means no 'autoregressiveness'.
Faster training as no recurrent steps required -> Better parallelization. But, pixel generation still sequential and thus slow
### Advantages and disadvantages
- Faster training
- Performance is worse than PixelRNN as context is discarded
- The cascaded convolutions create a 'blind spot', use Gated PixelCNN to fix
- No latest space
---
## References