# MuZero Code: https://github.com/werner-duvaud/muzero-general ## Resources Paper: https://arxiv.org/pdf/1911.08265.pdf Pytorch implementation: https://github.com/koulanurag/muzero-pytorch Videos: - Yannic Kilcher https://www.youtube.com/watch?v=We20YSAJZSE - Henry AI Labs https://www.youtube.com/watch?v=szbvm8aNDxw - Deepmind's Advance Deep RL https://www.youtube.com/watch?v=iOh7QUZGyiU&list=PLqYmG7hTraZDNJre23vqCGIVpfZ_K2RZs Blog posts: - How to build your own MuZero https://medium.com/applied-data-science/how-to-build-your-own-muzero-in-python-f77d5718061a ## Insights - Models that look better in terms of predictive accuracy may not always be better in terms of decision making. ## Future work - [[Graph Convolutional Networks (GCN)]] for representing states/boards - Contrastive Learning of Structured World Models ICLR 2020 https://arxiv.org/abs/1911.12247 - Generalization to imperfect information games (ReBeL) - Effect of different inductive biases to the learned abstract model - Generalization to multi-agent situations, - LIIR, NeurIPS 2020 (Dacheng Tao) https://papers.nips.cc/paper/2019/file/07a9d3fed4c5ea6b17e80258dee231fa-Paper.pdf - DIfferent models for environments (GameGAN) ## Paper MuZero builds upon AlphaZero’s powerful search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure. Does not require any knowledge of the game rules or environment dynamics, potentially paving the way towards the application of powerful learning and planning methods to a host of real- world domains for which there exists no perfect simulator. Precursors: [[The Predictron]], ICLR 2017 - Learns and uses abstract model at the same time to make predictions. Value equivalent model-only predicts cumulative reward, doesn't generate actions. [[Value Prediction Network]], NeurIPS 2017 - closest precursor of muzero [[TreeQN]], ICLR 2018 - Optimizes value, generates actions [[AlphaGo]], Nature 2016 - predeccsor to alphazero [[AlphaZero]], Science 2018 - use knowledge of the rules of the environment Related: [[World Models]], NeurIPS 2018 - Different approach to the similar problem. Uses VAE and RNN to learn representations and evolves a linear controller to predict optimal action. Learns pixel to pixel model of the world instead of abstract model. ## Model model = representation + dynamics + prediction $ \left.\begin{array}{rl} s^{0} & =h_{\theta}\left(o_{1}, \ldots, o_{t}\right) \\ r^{k}, s^{k} & =g_{\theta}\left(s^{k-1}, a^{k}\right) \\ \mathbf{p}^{k}, v^{k} & =f_{\theta}\left(s^{k}\right) \end{array}\right\} \mathbf{p}^{k}, v^{k}, r^{k}=\mu_{\theta}\left(o_{1}, \ldots, o_{t}, a^{1}, \ldots, a^{k}\right) $ Predicts three future quantities, things essential to reinforcement learning and not anything else. policy $\mathbf{p}_{t}^{k} \approx \pi\left(a_{t+k+1} \mid o_{1}, \ldots, o_{t}, a_{t+1}, \ldots, a_{t+k}\right)$ value function $v_{t}^{k} \approx \mathbb{E}\left[u_{t+k+1}+\gamma u_{t+k+2}+\ldots \mid o_{1}, \ldots, o_{t}, a_{t+1}, \ldots, a_{t+k}\right]$ immediate reward $r_{t}^{k} \approx u_{t+k}$ dynamics function $ r^{k}, s^{k}=g_{\theta}\left(s^{k-1}, a^{k}\right) $ - recurrent process - mirrors MDP model, that computes the expected reward and state transition for a given state and action - but differs in that $s^k$ has no semantics of environment state attached to it, simply a hidden state. - deterministic (stochastic transition for future work) prediction function $ \mathbf{p}^{k}, v^{k}=f_{\theta}\left(s^{k}\right) $ - policy and value computed from internal state - similar to alphazero's policy and value network representation function $ s^{0}=h_{\theta}\left(o_{1}, \ldots, o_{t}\right) $ - initializes the root state $s^0$ from observations - no semantics to real word ## Search $ \begin{aligned} \nu_{t}, \pi_{t} &=M C T S\left(s_{t}^{0}, \mu_{\theta}\right) \\ a_{t} & \sim \pi_{t} \end{aligned} $ - uses MCTS like AlphaZero - at each internal node, it uses policy, value and reward and outputs from the model - outputs a recommended policy $\pi_{t}$ and estimated value $\nu_{t} .$ An action $a_{t+1} \sim \pi_{t}$ is then selected. ## Planning ![[muzero-howitworks.jpg]] t is the index of the environment observations k is the index of the search tree For each observation from the environment at time step $t$ 1. Use representation function $h$ to produce root state $s_0$ from observations $o_1,...o_t$ 2. Root $s_0$ is used by prediction function to predict policy $p^0$ and value $v^0$ 3. Use this policy to generate action $a^1$ 4. Pass this action $a^1$ and state $s^0$ to dynamics function $g$, and predict next state $s^1$ and reward $r^1$ 5. Use this new state with prediction function to predict policy $p^1$ and value $v^1$. 6. Repeat steps 2 to 5 recursively upto a certain depth $k$ 7. Obtain value of each leaf node and produce final action. 8. Store the trajectory data into replay buffer. 9. Apply the action in the environment, obtain immediate reward and new observations for time step $t+1$. Repeat. ## Training Networks are jointly trained end-to-end with backprop through time. Goal is to align neural network policy, value and reward prediction with the mcts policy and observed value and reward. $ \mathbf{p}_{t}^{k}, v_{t}^{k}, r_{t}^{k}=\mu_{\theta}\left(o_{1}, \ldots, o_{t}, a_{t+1}, \ldots, a_{t+k}\right) $ $ z_{t}=\left\{\begin{array}{lr} u_{T} & \text { for games } \\ u_{t+1}+\gamma u_{t+2}+\ldots+\gamma^{n-1} u_{t+n}+\gamma^{n} \nu_{t+n} & \text { for general MDPs } \end{array}\right. $ - - all parameters trained jointly - first objective to minimize error between predicted policy $\mathbf{p}_{t}^{k}$ and search policy $\pi_{t+k}$. - second objective to minimize predicted value $v_{t}^{k}$ and the value target, $z_{t+k}$ - third objective is to minimize the error between the predicted reward $r_{t}^{k}$ and the observed reward $u_{t+k}$. $ l_{t}(\theta)=\sum_{k=0}^{K} l^{r}\left(u_{t+k}, r_{t}^{k}\right)+l^{v}\left(z_{t+k}, v_{t}^{k}\right)+l^{p}\left(\pi_{t+k}, \mathbf{p}_{t}^{k}\right)+c\|\theta\|^{2} $ $ \begin{aligned} l^{r}(u, r) &=\left\{\begin{array}{lr} 0 & \text { for games } \\ \phi(u)^{T} \log \mathbf{r} & \text { for general MDPs } \end{array}\right.\\ l^{v}(z, q)&=\left\{\begin{array}{lr} (z-q)^{2} & \text { for games } \\ \phi(z)^{T} \log \mathbf{q} & \text { for general MDPs } \end{array}\right.\\ l^{p}(\pi, p)&=\pi^{T} \log \mathbf{p} \end{aligned} $