# Autoregressive Models
Often, in data there is either an order or we can make up an order. From a generation point of view, data dimensions depend on each other. Autoregressive models are generative models without latent variables, but by assuming an order in the data. The likelihood is given by the product of conditionals $p(x)=\prod_{k=1}^{D} p\left(x_{k} \mid x_{j<k}\right)$.
How are Autoregressive models different than sequence models like RNN?
Autoregressive models have separate weights for each sequence position, making them fixed-length, while RNNs share the same weights across all time steps, which allows them to handle arbitrarily long sequences. Additionally, autoregressive models are inherently generative, whereas RNNs compress all previous inputs into a single hidden state and aren't necessarily generative.
## Decomposing likelihood of sequential data
If $x=\left[x_{1}, x_{2}, \ldots, x_{d}\right]$ is sequential, $p(x)$ decomposes with chain rule of probabilities
$
p(x)=p\left(x_{1}\right) \cdot p\left(x_{2} \mid x_{1}\right) \cdot p\left(x_{3} \mid x_{1}, x_{2}\right) \cdot \ldots \cdot p\left(x_{d} \mid x_{1}, \ldots, x_{d-1}\right)=\prod_{i=1}^{d} p\left(x_{i} \mid x_{<i}\right)
$
If $x$ is not sequential, we can assume an artificial order, e.g., the order with which pixels make (generate) an image. This can create artificial bias, however.
## Deep networks to model conditional likelihoods
Model the conditional likelihoods with deep neural networks
- Logistic regression (Frey et al., 1996), Neural nets (Bengio and Bengio, 2000 )
- E.g., learn a deep net to generate one pixel at a time given past pixels
The learning objective is to maximize the log-likelihood $\log p(\boldsymbol{x})$
- If each conditional is tractable, $\log p(x)$ is tractable
- Model conditional probabilities directly and with no partition functions $Z$
## Advantages and disadvantages
- Top density estimation
- They take into account complex co-dependencies
- Potentially, better generations and more accurate likelihoods
- Autoregressive models are not necessarily latent variable models. They neither have necessarily an encoder nor learn representations.
- Slow in learning, inference and generation. Computations are sequential (one at a time), so limited parallelism. E.g., to generate the next word we must generate past words first
- They may introduce artificial bias when assumed order is imposed.
## References