# Bellman Equation and Value Functions Value function under policy: $ v_{\pi}(s) \doteq \mathbb{E}_{\pi}\left[G_{t} \mid S_{t}=s\right]=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \mid S_{t}=s\right], \text { for all } s \in \mathcal{S} $ Action value function under policy: $ q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}\left[G_{t} \mid S_{t}=s, A_{t}=a\right]=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \mid S_{t}=s, A_{t}=a\right] $ Equation for $v_{\pi}$ in terms of $q_{\pi}$ and $\pi$: $ v_{\pi}(s) \doteq \sum_{a} \pi(a \mid s) q_{\pi}(s, a) $ Equation for $q_{\pi}$ in terms of $v_{\pi}$ and the four-argument $p$: $ q_{\pi}(s, a)=\sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right] $ ## Bellman Equation A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy recursive relationships: $ \begin{aligned} v_{\pi}(s) & \doteq \mathbb{E}_{\pi}\left[G_{t} \mid S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma G_{t+1} \mid S_{t}=s\right] \\ &=\sum_{a} \pi(a \mid s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma \mathbb{E}_{\pi}\left[G_{t+1} \mid S_{t+1}=s^{\prime}\right]\right] \\ &=\sum_{a} \pi(a \mid s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right], \quad \text { for all } s \in \mathcal{S} \end{aligned} $ ## Bellman Optimality Equation Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state: $ v_{*}(s) =\max _{a} \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_{*}\left(s^{\prime}\right)\right] $ And for action value function, $ q_{*}(s,a) =\sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma \max _{a^{\prime}} q_{*}\left(s^{\prime}, a^{\prime}\right)\right] $ Explicitly solving the Bellman optimality equation provides one route to finding an optimal policy, and thus to solving the reinforcement learning problem. This solution relies on at least three assumptions that are rarely true in practice: 1. The dynamics of the environment are accurately known 2. Computational resources are sufficient to complete the calculation 3. The states have the [[Markov Decision Processes#Markov Decision Processes#Markov Property|Markov Property]] --- ## References 1. Chapter 4.4, RL:AI, Sutton and Barto 2nd Edition