Bellman Equation and Value Functions

# Bellman Equation and Value Functions Value function under policy: $ v_{\pi}(s) \doteq \mathbb{E}_{\pi}\left[G_{t} \mid S_{t}=s\right]=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \mid S_{t}=s\right], \text { for all } s \in \mathcal{S} $ Action value function under policy: $ q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}\left[G_{t} \mid S_{t}=s, A_{t}=a\right]=\mathbb{E}_{\pi}\left[\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1} \mid S_{t}=s, A_{t}=a\right] $ Equation for $v_{\pi}$ in terms of $q_{\pi}$ and $\pi$: $ v_{\pi}(s) \doteq \sum_{a} \pi(a \mid s) q_{\pi}(s, a) $ Equation for $q_{\pi}$ in terms of $v_{\pi}$ and the four-argument $p$: $ q_{\pi}(s, a)=\sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right] $ ## Bellman Equation A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy recursive relationships: $ \begin{aligned} v_{\pi}(s) & \doteq \mathbb{E}_{\pi}\left[G_{t} \mid S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma G_{t+1} \mid S_{t}=s\right] \\ &=\sum_{a} \pi(a \mid s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma \mathbb{E}_{\pi}\left[G_{t+1} \mid S_{t+1}=s^{\prime}\right]\right] \\ &=\sum_{a} \pi(a \mid s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right], \quad \text { for all } s \in \mathcal{S} \end{aligned} $ ## Bellman Optimality Equation Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state: $ v_{*}(s) =\max _{a} \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_{*}\left(s^{\prime}\right)\right] $ And for action value function, $ q_{*}(s,a) =\sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma \max _{a^{\prime}} q_{*}\left(s^{\prime}, a^{\prime}\right)\right] $ This creates a recursive self-consistency condition that optimal values must satisfy. The recursive self-consistency means that optimal values are defined in terms of themselves - the value of state s depends on the optimal values of future states s'. This creates a system where optimal values are fixed points: if we have the true optimal values, plugging them into the right side of the equation gives us back the same values on the left side. This mathematical property is what enables iterative solution methods to converge to the optimal solution. The circularity isn't problematic because it captures a fundamental truth about optimality: an optimal decision at any state must account for making optimal decisions in all future states. The equation essentially says "if you're truly optimal everywhere, then your current value must equal the immediate reward plus the discounted optimal value of wherever you end up next." This self-referential definition uniquely characterizes the optimal value function, even though we don't know what those values are initially. This solution relies on at least three assumptions that are rarely true in practice: 1. The dynamics of the environment are accurately known 2. Computational resources are sufficient to complete the calculation 3. The states have the [[Markov Decision Processes#Markov Decision Processes#Markov Property|Markov Property]] ## References 1. Chapter 4.4, RL:AI, Sutton and Barto 2nd Edition