Dynamic Programming (RL)

# Dynamic Programming The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies. ## Policy Improvement Theorem Let $\pi$ and $\pi^{\prime}$ be any pair of deterministic policies such that, for all $s \in \mathcal{S}$, $ q_{\pi}\left(s, \pi^{\prime}(s)\right) \geq v_{\pi}(s) $ Then the policy $\pi^{\prime}$ must be as good as, or better than, $\pi$. That is, it must obtain greater or equal expected return from all states $s \in \mathcal{S}:$ $ v_{\pi^{\prime}}(s) \geq v_{\pi}(s) $ Moreover, if there is strict inequality of the first equation at any state, then there must be strict inequality of the above equation at that state. ## Policy Iteration 1. Initialization $V(s) \in \Re$ and $\pi(s) \in \mathcal{A}(s)$ arbitrarily for all $s \in \mathcal{S}$ 2. Policy Evaluation Repeat until $\Delta<\theta$ (a small positive number) $\Delta \leftarrow 0$ For each $s \in \mathcal{S}:$ $ \begin{aligned} &v \leftarrow V(s) \\ &V(s) \leftarrow \sum_{s^{\prime}} \mathcal{P}_{s s^{\prime}}^{\pi(s)}\left\lceil\mathcal{R}_{s s^{\prime}}^{\pi(s)}+\gamma V\left(s^{\prime}\right)\right] \\ &\Delta \leftarrow \max (\Delta,|v-V(s)|) \end{aligned} $ 3. Policy Improvement policy-stable $\leftarrow$ true For each $s \in \mathcal{S}$ : $ \begin{aligned} &b \leftarrow \pi(s) \\ &\pi(s) \leftarrow \arg \max _{a} \sum_{s^{\prime}} \mathcal{P}_{s s^{\prime}}^{a}\left[\mathcal{R}_{s s^{\prime}}^{a}+\gamma V\left(s^{\prime}\right)\right] \\ &\text{If } b \neq \pi(s), \text{then policy-stable} \leftarrow \text{false} \end{aligned} $ If policy-stable, then stop; else go to 2 Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations. ## Value Iteration One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration When policy evaluation is stopped after just one sweep (one backup of each state). This algorithm is called value iteration. Initialize $V$ arbitrarily, e.g., $V(s)=0$, for all $s \in \mathcal{S}^{+}$ 1. Repeat until $\Delta<\theta$ (a small positive number) $\Delta \leftarrow 0$ For each $s \in \mathcal{S}$ : $v \leftarrow V(s)$ $V(s) \leftarrow \max _{a} \sum_{s^{\prime}} \mathscr{P}_{s s^{\prime}}^{a}\left[\mathcal{R}_{s s^{\prime}}^{a}+\gamma V\left(s^{\prime}\right)\right]$ $\Delta \leftarrow \max (\Delta,|v-V(s)|)$ 2. Output a deterministic policy, $\pi$, such that $\pi(s)=\arg \max _{a} \sum_{s^{\prime}} \mathcal{P}_{s s^{\prime}}^{a}\left[\mathcal{R}_{s s^{\prime}}^{a}+\gamma V\left(s^{\prime}\right)\right]$ Value iteration is obtained simply by turning the [[Bellman Equation and Value Functions#Bellman Equation and Value Functions#Bellman Optimality Equation|Bellman Optimality Equation]] into an update rule. Note how the value iteration backup is identical to the policy evaluation backup except that it requires the maximum to be taken over all actions. Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement. ## Value iteration vs Policy iteration Value iteration lacks the explicit policy-value consistency that policy iteration maintains. Since value iteration derives its policy implicitly from current value estimates, this policy may be poor during early iterations, leading to slower convergence. Policy iteration ensures its policy and value function remain consistent throughout, often converging in fewer iterations despite each iteration being computationally more expensive. ## Generalized Policy Iteration GPI is the general idea of two interacting processes revolving around an approximate policy and an approximate value function. One process takes the policy as given and performs some form of policy evaluation, changing the value function to be more like the true value function for the policy. The other process takes the value function as given and performs some form of policy improvement, changing the policy to make it better, assuming that the value function is its value function. Although each process changes the basis for the other, overall they work together to find a joint solution: a policy and value function that are unchanged by either process and, consequently, are optimal. In some cases, GPI can be proved to converge, most notably for the classical DP methods above. In other cases convergence has not been proved, but still the idea of GPI improves our understanding of the methods. --- ## References 1. Chapter 4, RL:AI 2nd Edition, Sutton and Barto