PixelRNN - Notes on AI

# PixelRNN Decompose the data likelihood of an $n \times n$ image $p(x)=\prod_{i=1}^{n^{2}} p\left(x_{i} \mid x_{<i}\right)$ Each pixel conditional corresponds to a triplet of colors-> Further decompose per color (same as above) $ p\left(x_{i} \mid x_{<i}\right)=p\left(x_{i, R} \mid x_{<i}\right) \cdot p\left(x_{i, G} \mid x_{<i}, x_{i, R}\right) \cdot p\left(x_{i, B} \mid x_{<i}, x_{i, R}, x_{i, G}\right) $ Model the conditionals $p\left(x_{i, R} \mid x_{<i}\right), \ldots$ with 12 -layer convolutional RNN. The MLP from [[Neural Autoregressive Density Estimation|NADE]] cannot easily scale and statistics are not shared. Model the output as a categorical distribution 256-way softmax. PixelRNN uses two variant of [[LSTM]]. Why not a regular LSTM? It requires sequential, pixel-wise computations which leads to less parallelization and slower training. With Row LSTM and Diagonal BiLSTM, we process one row at a time, so parallelization is possible. ![[pixelrnn-lstms 1.jpg]] ## Row LSTM Row LSTM with 'causal' triangular receptive field. - Per new pixel (row $i$ ) use 1 -d conv (size 3 ) to aggregate pixels above $(i-1)$ - The effective receptive field spans a triangle - Convolution only on 'past' pixels $(i-1),$ not 'future pixels' -> causal - Loses some context because of the triangular nature of receptive field ## Diagonal BiLSTM Proposed to address lost context. Key idea: Have two LSTMs moving on oppose diagonals First diagonal: the convolution past is $(i-1, j),(i, j-1)$ Combine the two LSTMs, recursively the entirety of past context is captured. ## Architecture - Use 12 layers of LSTMs - Add residual connections to speed up learning - Although good modelling of $p(x)$, it has nice image generation, but slow training and generation because of LSTMs. ![[pixelrnn-layers.jpg]] ## Generations No collapse to single mode, lots of variation for same occluded images. ![[pixelrnn-generation.jpg]] ## PixelCNN Replace LSTMs with fully convolutional 15 layers , no pooling layers to preserve spatial resolution Use masks to mask out future pixels in convolutions, otherwise 'access to future' means no 'autoregressiveness'. Faster training as no recurrent steps required -> Better parallelization. But, pixel generation still sequential and thus slow ### Advantages and disadvantages - Faster training - Performance is worse than PixelRNN as context is discarded - The cascaded convolutions create a 'blind spot', use Gated PixelCNN to fix - No latest space --- ## References