# Value Prediction Network
- Code https://github.com/koulanurag/vpn (Pytorch)
- Original verison https://github.com/junhyukoh/value-prediction-network (Tensorflow)
- Venue: NeurIPS 2017
Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network.
## The problem
- [[Model Based Reinforcement Learning]] approaches attempt to learn a model that predicts future observations conditioned on actions and can thus be used to simulate the real environment and do multi-step lookaheads for planning. Such models are called *observation-prediction model*.
- Often very challenging when the observation space is large (e.g., high- dimensional pixel-level image frames), and even more difficult when the environment is stochastic.
- Raw observations may contain information unnecessary for planning, such as dynamically changing backgrounds in visual observations that are irrelevant to their value/utility.
- **What planning truly requires is the ability to predict the rewards and values of future states.**
- What if we could predict future rewards and values directly without predicting future observations? Such a model could be more easily learnable for complex domains or more flexible for dealing with stochasticity. Such model can be referred to *value-prediction model*.
## The solution
- A VPN not only learns an option-value function $Q_{\theta}\left(\mathbf{x}_{t}, \mathbf{o}_{t}\right)$ through a neural network parameterized by θ like model-free RL, but also learns the dynamics of the rewards/values to perform planning.
- VPN is model-based in the sense that it learns an abstract-state transition function sufficient to predict rewards/discount/values. Meanwhile, VPN can also be viewed as model-free in the sense that it learns to directly estimate the value of the abstract-state
![[VPN.jpg]]
## The details
- Works on [[Semi-Markov Decision Processes]]
- Let $x_t$ be the observation or a history of observations for partially observable MDP. Let $o_t$ be the [[Semi-Markov Decision Processes#Option]] at time $t$. Each option maps observations to primitive actions and the Bellman equation holds for all policies $\pi: Q^{\pi}\left(\mathbf{x}_{t}, \mathbf{o}_{t}\right)=\mathbb{E}\left[\sum_{i=0}^{k-1} \gamma^{i} r_{t+i}+\gamma^{k} V^{\pi}\left(\mathbf{x}_{t+k}\right)\right]$
### Components
- A VPN consists of $\theta=\left\{\theta^{\text {enc}}, \theta^{\text {value}}, \theta^{\text {out}}, \theta^{\text {trans}}\right\}$
- *Encoding module* maps observations x to an abstract-state representation which will be learned by the network (and not an environment state or even an approximation to one).
- *Value module* estimates the value of the abstract-state $\left(V_{\theta}(\mathbf{s})\right)$
- *Outcome module* predicts the option-reward for executing the option $\mathbf{o}$ at abstract-state, also predicts option-discount induced by the number of steps taken by the option.
- *Transition module* transforms the abstract-state to the next abstract-state in an option conditional manner.
- *The Core Module* takes an abstract-state and option as input and makes separate option-conditional predictions of the reward, discount and the value of the abstract-state.
### Planning
![[VPN-planning.jpg]]
- Methods like [[Monte-Carlo Tree Search]] etc can be applied, but a simple rollout to depth $d$ called planning depth is used.
- Q-value from d-step planning is defined as:
$
Q_{\theta}^{d}(\mathbf{s}, \mathbf{o})=r+\gamma V_{\theta}^{d}\left(\mathbf{s}^{\prime}\right) \quad V_{\theta}^{d}(\mathbf{s})=\left\{\begin{array}{ll}
V_{\theta}(\mathbf{s}) & \text { if } d=1 \\
\frac{1}{d} V_{\theta}(\mathbf{s})+\frac{d-1}{d} \max _{\mathbf{0}} Q_{\theta}^{d-1}(\mathbf{s}, \mathbf{o}) & \text { if } d>1
\end{array}\right.
$
- $\mathbf{s}^{\prime}=f_{\theta}^{\text {trans}}(\mathbf{s}, \mathbf{o}), V_{\theta}(\mathbf{s})=f_{\theta}^{\text {value}}(\mathbf{s}),$ and $r, \gamma=f_{\theta}^{\text {out}}(\mathbf{s}, \mathbf{o})$
- Expnasion step - recursively simulate options up to a depth of d by unrolling the core module
- Backup step - we compute the weighted average of the direct value estimate $V_{\theta}(\mathbf{s})$ and $\max _{\mathbf{0}} Q_{\theta}^{d-1}(\mathbf{s}, \mathbf{o})$ to compute $V_{\theta}^{d}(\mathbf{s})$
- To reduce the computational cost, we simulate only b-best options at each expansion step based on $Q^{1}(\mathbf{s}, \mathbf{o})$.
### Learning
![[VPN-learning.jpg]]
- VPN can be trained through any existing value- based RL algorithm for the value predictions combined with supervised learning for reward and discount predictions. In the paper, a modification of n-step Q-learning is used.
- Given an n-step trajectory $\mathbf{x}_{1}, \mathbf{o}_{1}, r_{1}, \gamma_{1}, \mathbf{x}_{2}, \mathbf{o}_{2}, r_{2}, \gamma_{2}, \ldots, \mathbf{x}_{n+1}$ generated by the $\epsilon$ -greedy policy, $k$ -step predictions are defined as follows:
$
\mathbf{s}_{t}^{k}=\left\{\begin{array}{ll}
f_{\theta}^{e n c}\left(\mathbf{x}_{t}\right) & \text { if } k=0 \\
f_{\theta}^{\text {trans}}\left(\mathbf{s}_{t-1}^{k-1}, \mathbf{o}_{t-1}\right) & \text { if } k>0
\end{array} \quad v_{t}^{k}=f_{\theta}^{\text {value}}\left(\mathbf{s}_{t}^{k}\right) \quad r_{t}^{k}, \gamma_{t}^{k}=f_{\theta}^{\text {out}}\left(\mathbf{s}_{t}^{k-1}, \mathbf{o}_{t}\right)\right.
$
- Intuitively, $\mathbf{s}_{t}^{k}$ is the VPN's $k$ -step prediction of the abstract-state at time $t$ predicted from $\mathbf{x}_{t-k}$ following options $\mathbf{o}_{t-k}, \ldots, \mathbf{o}_{t-1}$ in the trajectory.
- The k-step prediction loss at step t is defined as:
$
\mathcal{L}_{t}=\sum_{l=1}^{k}\left(R_{t}-v_{t}^{l}\right)^{2}+\left(r_{t}-r_{t}^{l}\right)^{2}+\left(\log _{\gamma} \gamma_{t}-\log _{\gamma} \gamma_{t}^{l}\right)^{2}
$
where
$
R_{t}=\left\{\begin{array}{ll}
r_{t}+\gamma_{t} R_{t+1} & \text { if } t \leq n \\
\max _{0} Q_{\theta^{-}}^{d}\left(\mathbf{s}_{n+1}, \mathbf{o}\right) & \text { if } t=n+1
\end{array}\right.
$
and $Q_{\theta^{-}}^{d}\left(\mathbf{s}_{n+1}, o\right)$ is the Q-value computed by the d-step planning method.
- CNN for encoding module, transition module one option-conditional convolution layer which uses different weights depending on the option followed by a few more convolution layers.
- The outcome module is similar to the transition module except that it does not have a residual connection and two fully-connected layers are used to produce reward and discount. The value module consists of two fully-connected layers.
-
## The results
- VPNs outperform DQN and VPN(1) baselines by a large margin.
- Compared to OPNs which perform planning based on predicted observations, VPNs perform slightly better or equally well in the deterministic environment. In the stochastic Collect domain, VPNs significantly outperform OPNs.
- VPN is much more robust to the unseen environments compared to model-free baselines. The model-free baselines perform worse than the greedy algorithm on unseen environments, whereas VPN still performs well.
- VPN generalizes as well as OPN which can learn a near-perfect model in the deterministic setting, and VPN significantly outperforms OPN in the stochastic setting.
---
## References
1. NeurIPS 2017 Paper https://arxiv.org/pdf/1707.03497.pdf