Temporal Difference Learning

# Temporal Difference Learning Whereas Monte Carlo methods must wait until the end of the episode to determine the target for the update (Gt), TD methods need to wait only until the next time step, effectively _bootstrapping_ off of existing estimate, like DP methods. ## Advantages of TD methods 1. No model of the environments is required, an advantage over DP methods. 2. Naturally implemented in an online, fully incremental way. Advantage over MC methods. 3. Even with bootstrapping and no model, convergence guarentees are in place. 4. Converge faster than constant step MC on stochastic tasks. ## TD(0) The simplest temporal difference update is known as TD(0) or one-step TD. $ V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left[R_{t+1}+\gamma V\left(S_{t+1}\right)-V\left(S_{t}\right)\right] $ Under batch updating, $\mathrm{TD}(0)$ converges deterministically to a single answer independent of the step-size parameter, $\alpha$, as long as $\alpha$ is chosen to be sufficiently small. Batch MC methods always find the estimates that minimize mean square error on the training set whereas batch TD(0) finds the estimates that would be exactly correct for [[Maximum Likelihood Estimation]] of the Markov process. This is why TD methods converge more quickly than MC methods as it computes the true _certainity-equivalence estimate_ i.e. estimate of value function would be exactly correct if the model is exacly correct. ## Sarsa Sarsa is an on-policy TD control method, defined as: $ Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma Q\left(S_{t+1}, A_{t+1}\right)-Q\left(S_{t}, A_{t}\right)\right] $ Sarsa converges with probability 1 to an optimal policy and action-value function as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy (which can be arranged, for example, with $\epsilon$-greedy policies by setting $\epsilon$ = 1/t). ![[Sarsa.png]] ## Q-Learning Q-Learning is an off-policy TD control algorithm, defined as: $ Q\left(S_{t}, A_{t}\right) \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \max _{a} Q\left(S_{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right] $ In this case, the learned action-value function, $Q$, directly approximates $q_{*}$, the optimal action-value function, independent of the policy being followed. The policy still has an effect in that it determines which state-action pairs are visited and updated. However, all that is required for correct convergence is that all pairs continue to be updated. ![[Q-Learning.png]] Why is Q-Learning an off-policy algorithm while Sarsa is on-policy? - The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s and the greedy action a. In other words, it estimates the return for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy. The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s and the current policy's action a. It estimates the return for state-action pairs assuming the current policy continues to be followed. Suppose action selection is greedy, in that case is Q-learning the exactly same algorithm as Sarsa? Will they make the same action selection and weight updates? - They seem to be the same method in general. But it will be different in occasional situation. Consider $S=S^{\prime}$ during some steps. During update, SARSA will use the previous greedy choice $A^{\prime}$ made from $S^{\prime}$, and advance to $Q\left(S^{\prime}, A^{\prime}\right) .$ But Q-learning will use the updated $Q$ to re-choose greedy $A^{*}$. Because $A^{\prime}$ may not be the same action as the new greedy choice $A^{*}$ that will maximize $\mathrm{Q}\left(\mathrm{S}^{\prime}, \mathrm{A}^{*}\right)$, these two methods are different. ## Expected Sarsa Expected Sarsa is similar to Q-learning but differs in a way that instead of taking maximum over next state-action pairs, it uses the expected value, taking into account how likely each action is under the current policy: $ \begin{aligned} Q\left(S_{t}, A_{t}\right) & \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \mathbb{E}_{\pi}\left[Q\left(S_{t+1}, A_{t+1}\right) \mid S_{t+1}\right]-Q\left(S_{t}, A_{t}\right)\right] \\ & \leftarrow Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \sum_{a} \pi\left(a \mid S_{t+1}\right) Q\left(S_{t+1}, a\right)-Q\left(S_{t}, A_{t}\right)\right] \end{aligned} $ Expected Sarsa is more complex computationally than Sarsa but, in return, it eliminates the variance due to the random selection of $A_{t+1}$. Given the same amount of experience we might expect it to perform slightly better than Sarsa, and indeed it generally does. Expected Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa. Except for the small additional computational cost, Expected Sarsa may completely dominate both of the other more-well-known TD control algorithms. ## Double Q-Learning In Q-Learning, there will be a positive maximization bias if we use the maximum of the estimates as an estimate of the maximum of the true values. One way to view the problem is that it is due to using the same samples (plays) both to determine the maximizing action and to estimate its value. Suppose we divided the plays in two sets and used them to learn two independent estimates, call them $Q_{1}(a)$ and $Q_{2}(a)$, each an estimate of the true value $q(a)$, for all $a \in \mathcal{A}$. We could then use one estimate, say $Q_{1}$, to determine the maximizing action $A^{*}=\arg \max _{a} Q_{1}(a)$, and the other, $Q_{2}$, to provide the estimate of its value, $Q_{2}\left(A^{*}\right)=Q_{2}\left(\operatorname{argmax}_{a} Q_{1}(a)\right)$. This estimate will then be unbiased in the sense that $\mathbb{E}\left[Q_{2}\left(A^{*}\right)\right]=q\left(A^{*}\right)$. We can also repeat the process with the role of the two estimates reversed to yield a second unbiased estimate $Q_{1}\left(\operatorname{argmax}_{a} Q_{2}(a)\right)$. This is the idea of double learning: $ Q_{1}\left(S_{t}, A_{t}\right) \leftarrow Q_{1}\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma Q_{2}\left(S_{t+1}, \operatorname{argmax}_{a} Q_{1}\left(S_{t+1}, a\right)\right)-Q_{1}\left(S_{t}, A_{t}\right)\right] $ Note that although we learn two estimates, only one estimate is updated on each play; double learning doubles the memory requirements, but does not increase the amount of computation per step. ## n-step TD n-step TD are the best-of-both variation of TD(0) and MC methods. $ V_{t+n}\left(S_{t}\right) \doteq V_{t+n-1}\left(S_{t}\right)+\alpha\left[G_{t: t+n}-V_{t+n-1}\left(S_{t}\right)\right], \quad 0 \leq t<T $ where, n-step return is defined as, $ G_{t: t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^{n} V_{t+n-1}\left(S_{t+n}\right) $ The methods that use n-step updates are still TD methods because they still change an earlier estimate based on how it differs from a later estimate. Now the later estimate is not one step later, but n steps later. ## n-step Sarsa We redefine n-step returns (update targets) in terms of estimated action values: $ G_{t: t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^{n} Q_{t+n-1}\left(S_{t+n}, A_{t+n}\right), \quad n \geq 1,0 \leq t<T-n $ with $G_{t: t+n} \doteq G_{t}$ if $t+n \geq T$. The natural algorithm is then $ Q_{t+n}\left(S_{t}, A_{t}\right) \doteq Q_{t+n-1}\left(S_{t}, A_{t}\right)+\alpha\left[G_{t: t+n}-Q_{t+n-1}\left(S_{t}, A_{t}\right)\right], \quad 0 \leq t<T $ while the values of all other states remain unchanged: $Q_{t+n}(s, a)=Q_{t+n-1}(s, a)$, for all $s, a$ such that $s \neq S_{t}$ or $a \neq A_{t}$. This is the algorithm we call $n$-step Sarsa. ## n-step Off-policy Learning For off-policy version, we take into account the difference between the two policies using relative probability of taking the actions that were taken: $ V_{t+n}\left(S_{t}\right) \doteq V_{t+n-1}\left(S_{t}\right)+\alpha \rho_{t: t+n-1}\left[G_{t: t+n}-V_{t+n-1}\left(S_{t}\right)\right], \quad 0 \leq t<T $ where $\rho_{t: t+n-1}$, called the importance sampling ratio, is the relative probability under the two policies of taking the $n$ actions from $A_{t}$ to $A_{t+n-1}$: $ \rho_{t: h} \doteq \prod_{k=t}^{\min (h, T-1)} \frac{\pi\left(A_{k} \mid S_{k}\right)}{b\left(A_{k} \mid S_{k}\right)} $ --- ## References 1. Chapter 6, RL:AI 2nd Edition, Sutton and Barto