# World Models
Code: https://arxiv.org/abs/1803.10122
Blog: https://worldmodels.github.io/
Talk: https://www.youtube.com/watch?v=HzA8LRqhujk&feature=youtu.be
Train [[Variational Autoencoders]] for spatial representation and [[Recurrent Neural Networks (RNN)]] for temporal representation. Then evolve a simple linear model and use the concatenated spatial and temporal representations to predict agent action.
![[world-models.png]]
1. Collect 10,000 rollouts from a random policy.
2. Train VAE (V) to encode frames into $z \in R^{32}$.
3. Train MDN-RNN (M) to model $P\left(z_{t+1} \mid a_{t}, z_{t}, h_{t}\right)$.
4. Evolve linear controller (C) to maximize the expected cumulative reward of a rollout. $\left.a_{t}=W_{c} \mid z_{t} h_{t}\right]+b_{c}$
Results
- Beats DQN, A3C in CarRacing-v0
Best thing about it though is you can use the VAE to generate new training data and improve the model further.
![[worldmodels2.jpg]]
1. Collect 10,000 rollouts from a random policy.
2. Train VAE (V) to encode frames into $z \in \mathcal{R}^{64}$.
3. Train MDN-RNN (M) to model $P\left(z_{t+1}, d_{t+1} \mid a_{t}, z_{t}, h_{t}\right)$.
4. Evolve linear controller (C) to maximize the expectied cumulative reward of a rollout.
5. Deploy learned policy from (4) on actual emironment.
Assumes initial random policy is enough to collect a dataset that maps most of the state of the environment, which isn't really true for any environment that requires exploration.
Solution: Iterative training policy
1. Initialize M. C with random model parameters.
2. Rollout to actual environment $N$ times. Save all actions $a_{t}$ and observations $x_{i}$ during rollouts to storage.
3. Train M to model $P\left(x_{t+1}, r_{1+1}, a_{t+1}, d_{t+1} \mid x_{t}, a_{t}, h_{t}\right)$
and train $\mathrm{C}$ to optimize expected rewards inside of $\mathrm{M}$.
4. Go back to (2) if task has not been completed.