# World Models Code: https://arxiv.org/abs/1803.10122 Blog: https://worldmodels.github.io/ Talk: https://www.youtube.com/watch?v=HzA8LRqhujk&feature=youtu.be Train [[Variational Autoencoders]] for spatial representation and [[Recurrent Neural Networks (RNN)]] for temporal representation. Then evolve a simple linear model and use the concatenated spatial and temporal representations to predict agent action. ![[world-models.png]] 1. Collect 10,000 rollouts from a random policy. 2. Train VAE (V) to encode frames into $z \in R^{32}$. 3. Train MDN-RNN (M) to model $P\left(z_{t+1} \mid a_{t}, z_{t}, h_{t}\right)$. 4. Evolve linear controller (C) to maximize the expected cumulative reward of a rollout. $\left.a_{t}=W_{c} \mid z_{t} h_{t}\right]+b_{c}$ Results - Beats DQN, A3C in CarRacing-v0 Best thing about it though is you can use the VAE to generate new training data and improve the model further. ![[worldmodels2.jpg]] 1. Collect 10,000 rollouts from a random policy. 2. Train VAE (V) to encode frames into $z \in \mathcal{R}^{64}$. 3. Train MDN-RNN (M) to model $P\left(z_{t+1}, d_{t+1} \mid a_{t}, z_{t}, h_{t}\right)$. 4. Evolve linear controller (C) to maximize the expectied cumulative reward of a rollout. 5. Deploy learned policy from (4) on actual emironment. Assumes initial random policy is enough to collect a dataset that maps most of the state of the environment, which isn't really true for any environment that requires exploration. Solution: Iterative training policy 1. Initialize M. C with random model parameters. 2. Rollout to actual environment $N$ times. Save all actions $a_{t}$ and observations $x_{i}$ during rollouts to storage. 3. Train M to model $P\left(x_{t+1}, r_{1+1}, a_{t+1}, d_{t+1} \mid x_{t}, a_{t}, h_{t}\right)$ and train $\mathrm{C}$ to optimize expected rewards inside of $\mathrm{M}$. 4. Go back to (2) if task has not been completed.