# The Predictron Code: https://github.com/zhongwen/predictron (Tensorflow) Venue: ICLR 2017 ## The problem - The refinement of value estimates based on imagined trajectories is often referred to as planning. - A model in [[Model Based Reinforcement Learning]] algorithm is trained essentially independently of its use within the planner. As a result, the model is not well-matched with the overall objective of the agent. - Prior deep reinforcement learning methods have successfully constructed models that can unroll near pixel-perfect reconstructions but are yet to surpass state-of-the-art model-free methods in challenging RL domains with raw inputs. - An ideal model could generalise to many different prediction tasks, rather than overfitting to a single task; and could learn from a rich variety of feedback signals, not just a single extrinsic reward. ## The solution - *The predictron* proposes to jointly build and use the model for planning, learning only an learn *abstract* model whose only requirement is that it be useful. It doesn't have to match the real dynamics of the environment. can discard irrelevant information. - *The predictron* integrates learning and planning into one end-to-end training procedure. At every step, a model is applied to an internal state, to produce a next state, reward, discount, and value estimate. This model is completely abstract and its only goal is to facilitate accurate value prediction. - All the predictron's model requires is that trajectories through the abstract model produce scores that are consistent with trajectories through the real environment. This is achieved by training the predictron end-to-end, so as to make its value estimates as accurate as possible. - Compared to for example [[World Models]], predictron's model may not be interpretable or accuractly represent the real world. - Predictron provides a way to compose iterative compute in such a way it makes easy to solve combinatorial problems. ## The details Overview of predictron architecture: 1. An abstract environment model is learned 2. Model is used to simulate forward several steps 3. Imagined states are evaluated 4. Values are combined - Works on [[Markov Reward Processes]], not full MDP i.e with fixed policy. The MRP is defined by a function, $s^{\prime}, r, \gamma=$ $p(s, \alpha),$ where $s^{\prime}$ is the next state, $r$ is the reward, and $\gamma$ is the discount factor, which can for instance represent the non-termination probability for this transition. The process may be stochastic, given IID noise $\alpha$. - Main components - First, a state representation $\mathrm{s}=f(s)$ that encodes raw input $s$ (this could be a history of observations) - Second, a model $\mathbf{s}^{\prime}, \mathbf{r}, \gamma=m(\mathbf{s}, \beta)$ that maps from internal state $\mathbf{s}$ to subsequent internal state $\mathbf{s}^{\prime},$ internal rewards $\mathbf{r},$ and internal discounts $\gamma .$ - Third, a value function $v$ that outputs internal values $\mathbf{v}=v(\mathbf{s})$ representing the remaining internal return from internal state $\mathbf{s}$ onwards. - Finally, these internal rewards, discounts and values are combined together by an accumulator into an overall estimate of value $\mathrm{g}$. ![[predictron.jpg]] - Accumulators - The *k-step predictron* rolls its internal model forward $k$ steps. The k-step predictron return $\mathrm{g}^{k}$ is the internal return obtained by accumulating $k$ model steps, plus a discounted final value $\mathbf{v}^{k}$ from the $k$ th step: $ \mathrm{g}^{k}=\mathbf{r}^{1}+\gamma^{1}\left(\mathrm{r}^{2}+\gamma^{2}\left(\ldots+\gamma^{k-1}\left(\mathrm{r}^{k}+\gamma^{k} \mathrm{v}^{k}\right) \ldots\right)\right) $ - The $\lambda$ -predictron combines together many $k$ -step preturns. Specifically, it computes a diagonal weight matrix $\lambda^{k}$ from each internal state $s^{k} .$ The accumulator uses weights $\lambda^{0}, \ldots, \lambda^{K}$ to aggregate over $k$ -step preturns $\mathrm{g}^{0}, \ldots, \mathrm{g}^{K}$ and output a combined value that we call the $\lambda$ -preturn $\mathrm{g}^{\lambda}$, $ \begin{array}{c} \mathbf{g}^{\lambda}=\sum_{k=0}^{K} \boldsymbol{w}^{k} \mathbf{g}^{k} \\ \boldsymbol{w}^{k}=\left\{\begin{array}{ll} \left(\mathbf{1}-\boldsymbol{\lambda}^{k}\right) \prod_{j=0}^{k-1} \lambda^{j} & \text { if } k<K \\ \prod_{j=0}^{K-1} \lambda^{j} & \text { otherwise. } \end{array}\right. \end{array} $ - The individual $\lambda^{k}$ weights may depend on the corresponding abstract state $\mathbf{s}^{k}$ and can differ per prediction. This enables the predictron to compute to an adaptive depth (Graves, 2016 ) depending on the internal state and learning dynamics of the network. Summary of the architecture - zero step (model free) x ->s0(v0) (g0 = v0) - one step x -> s0 (r0) -> s1(v1) (g1 = r0 + v1) - two step x -> s0(r0) -> s1(r1) -> s2(v2) (g2 = r0+r1+v2) - for final prediction, we use weighted ensemble g(x) = $\sum_k w^kg^k$ - learns end to end ## The results All components of the predictron improves performance. - beats RNN and resnets - can learn with unlabelled data --- ## References 1. ICLR 2017 Paper https://arxiv.org/pdf/1612.08810.pdf 2. Talk https://vimeo.com/238243832