Value Prediction Network

# Value Prediction Network - Code https://github.com/koulanurag/vpn (Pytorch) - Original verison https://github.com/junhyukoh/value-prediction-network (Tensorflow) - Venue: NeurIPS 2017 Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. ## The problem - [[Model Based Reinforcement Learning]] approaches attempt to learn a model that predicts future observations conditioned on actions and can thus be used to simulate the real environment and do multi-step lookaheads for planning. Such models are called *observation-prediction model*. - Often very challenging when the observation space is large (e.g., high- dimensional pixel-level image frames), and even more difficult when the environment is stochastic. - Raw observations may contain information unnecessary for planning, such as dynamically changing backgrounds in visual observations that are irrelevant to their value/utility. - **What planning truly requires is the ability to predict the rewards and values of future states.** - What if we could predict future rewards and values directly without predicting future observations? Such a model could be more easily learnable for complex domains or more flexible for dealing with stochasticity. Such model can be referred to *value-prediction model*. ## The solution - A VPN not only learns an option-value function $Q_{\theta}\left(\mathbf{x}_{t}, \mathbf{o}_{t}\right)$ through a neural network parameterized by θ like model-free RL, but also learns the dynamics of the rewards/values to perform planning. - VPN is model-based in the sense that it learns an abstract-state transition function sufficient to predict rewards/discount/values. Meanwhile, VPN can also be viewed as model-free in the sense that it learns to directly estimate the value of the abstract-state ![[VPN.jpg]] ## The details - Works on [[Semi-Markov Decision Processes]] - Let $x_t$ be the observation or a history of observations for partially observable MDP. Let $o_t$ be the [[Semi-Markov Decision Processes#Option]] at time $t$. Each option maps observations to primitive actions and the Bellman equation holds for all policies $\pi: Q^{\pi}\left(\mathbf{x}_{t}, \mathbf{o}_{t}\right)=\mathbb{E}\left[\sum_{i=0}^{k-1} \gamma^{i} r_{t+i}+\gamma^{k} V^{\pi}\left(\mathbf{x}_{t+k}\right)\right]$ ### Components - A VPN consists of $\theta=\left\{\theta^{\text {enc}}, \theta^{\text {value}}, \theta^{\text {out}}, \theta^{\text {trans}}\right\}$ - *Encoding module* maps observations x to an abstract-state representation which will be learned by the network (and not an environment state or even an approximation to one). - *Value module* estimates the value of the abstract-state $\left(V_{\theta}(\mathbf{s})\right)$ - *Outcome module* predicts the option-reward for executing the option $\mathbf{o}$ at abstract-state, also predicts option-discount induced by the number of steps taken by the option. - *Transition module* transforms the abstract-state to the next abstract-state in an option conditional manner. - *The Core Module* takes an abstract-state and option as input and makes separate option-conditional predictions of the reward, discount and the value of the abstract-state. ### Planning ![[VPN-planning.jpg]] - Methods like [[Monte-Carlo Tree Search]] etc can be applied, but a simple rollout to depth $d$ called planning depth is used. - Q-value from d-step planning is defined as: $ Q_{\theta}^{d}(\mathbf{s}, \mathbf{o})=r+\gamma V_{\theta}^{d}\left(\mathbf{s}^{\prime}\right) \quad V_{\theta}^{d}(\mathbf{s})=\left\{\begin{array}{ll} V_{\theta}(\mathbf{s}) & \text { if } d=1 \\ \frac{1}{d} V_{\theta}(\mathbf{s})+\frac{d-1}{d} \max _{\mathbf{0}} Q_{\theta}^{d-1}(\mathbf{s}, \mathbf{o}) & \text { if } d>1 \end{array}\right. $ - $\mathbf{s}^{\prime}=f_{\theta}^{\text {trans}}(\mathbf{s}, \mathbf{o}), V_{\theta}(\mathbf{s})=f_{\theta}^{\text {value}}(\mathbf{s}),$ and $r, \gamma=f_{\theta}^{\text {out}}(\mathbf{s}, \mathbf{o})$ - Expnasion step - recursively simulate options up to a depth of d by unrolling the core module - Backup step - we compute the weighted average of the direct value estimate $V_{\theta}(\mathbf{s})$ and $\max _{\mathbf{0}} Q_{\theta}^{d-1}(\mathbf{s}, \mathbf{o})$ to compute $V_{\theta}^{d}(\mathbf{s})$ - To reduce the computational cost, we simulate only b-best options at each expansion step based on $Q^{1}(\mathbf{s}, \mathbf{o})$. ### Learning ![[VPN-learning.jpg]] - VPN can be trained through any existing value- based RL algorithm for the value predictions combined with supervised learning for reward and discount predictions. In the paper, a modification of n-step Q-learning is used. - Given an n-step trajectory $\mathbf{x}_{1}, \mathbf{o}_{1}, r_{1}, \gamma_{1}, \mathbf{x}_{2}, \mathbf{o}_{2}, r_{2}, \gamma_{2}, \ldots, \mathbf{x}_{n+1}$ generated by the $\epsilon$ -greedy policy, $k$ -step predictions are defined as follows: $ \mathbf{s}_{t}^{k}=\left\{\begin{array}{ll} f_{\theta}^{e n c}\left(\mathbf{x}_{t}\right) & \text { if } k=0 \\ f_{\theta}^{\text {trans}}\left(\mathbf{s}_{t-1}^{k-1}, \mathbf{o}_{t-1}\right) & \text { if } k>0 \end{array} \quad v_{t}^{k}=f_{\theta}^{\text {value}}\left(\mathbf{s}_{t}^{k}\right) \quad r_{t}^{k}, \gamma_{t}^{k}=f_{\theta}^{\text {out}}\left(\mathbf{s}_{t}^{k-1}, \mathbf{o}_{t}\right)\right. $ - Intuitively, $\mathbf{s}_{t}^{k}$ is the VPN's $k$ -step prediction of the abstract-state at time $t$ predicted from $\mathbf{x}_{t-k}$ following options $\mathbf{o}_{t-k}, \ldots, \mathbf{o}_{t-1}$ in the trajectory. - The k-step prediction loss at step t is defined as: $ \mathcal{L}_{t}=\sum_{l=1}^{k}\left(R_{t}-v_{t}^{l}\right)^{2}+\left(r_{t}-r_{t}^{l}\right)^{2}+\left(\log _{\gamma} \gamma_{t}-\log _{\gamma} \gamma_{t}^{l}\right)^{2} $ where $ R_{t}=\left\{\begin{array}{ll} r_{t}+\gamma_{t} R_{t+1} & \text { if } t \leq n \\ \max _{0} Q_{\theta^{-}}^{d}\left(\mathbf{s}_{n+1}, \mathbf{o}\right) & \text { if } t=n+1 \end{array}\right. $ and $Q_{\theta^{-}}^{d}\left(\mathbf{s}_{n+1}, o\right)$ is the Q-value computed by the d-step planning method. - CNN for encoding module, transition module one option-conditional convolution layer which uses different weights depending on the option followed by a few more convolution layers. - The outcome module is similar to the transition module except that it does not have a residual connection and two fully-connected layers are used to produce reward and discount. The value module consists of two fully-connected layers. - ## The results - VPNs outperform DQN and VPN(1) baselines by a large margin. - Compared to OPNs which perform planning based on predicted observations, VPNs perform slightly better or equally well in the deterministic environment. In the stochastic Collect domain, VPNs significantly outperform OPNs. - VPN is much more robust to the unseen environments compared to model-free baselines. The model-free baselines perform worse than the greedy algorithm on unseen environments, whereas VPN still performs well. - VPN generalizes as well as OPN which can learn a near-perfect model in the deterministic setting, and VPN significantly outperforms OPN in the stochastic setting. --- ## References 1. NeurIPS 2017 Paper https://arxiv.org/pdf/1707.03497.pdf