On-policy learning with approximation

# On-policy learning with approximation ## The Prediction Objective (VE) In approximation methods, by assumption we have far more states than weights, so making one state's estimate more accurate invariably means making others' less accurate. We are obligated then to say which states we care most about. We must specify a state distribution $\mu(s) \geq 0, \sum_{s} \mu(s)=1$, representing how much we care about the error in each state $s$. Then mean squared value error (VE) is given as $ \overline{\mathrm{VE}}(\mathbf{w}) \doteq \sum_{s \in \mathcal{S}} \mu(s)\left[v_{\pi}(s)-\hat{v}(s, \mathbf{w})\right]^{2} $ Generally, $\mu(s)$ is chosen to be the fraction of time spent in the state s. ## Stochastic-gradient and Semi-gradient Methods SGD method for minimizing prediction objective VE is given as, $ \begin{aligned} \mathbf{w}_{t+1} & \doteq \mathbf{w}_{t}-\frac{1}{2} \alpha \nabla\left[v_{\pi}\left(S_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)\right]^{2} \\ &=\mathbf{w}_{t}+\alpha\left[v_{\pi}\left(S_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)\right] \nabla \hat{v}\left(S_{t}, \mathbf{w}_{t}\right) \end{aligned} $ Because $v_{\pi}\left(S_{t}\right)$ is unknown, we can approximate it by substituting $U_t$. This yeilds the general SGD method for state value prediction: $ \mathbf{w}_{t+1} \doteq \mathbf{w}_{t}+\alpha\left[U_{t}-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)\right] \nabla \hat{v}\left(S_{t}, \mathbf{w}_{t}\right) $ If $U_{t}$ is an unbiased estimate, that is, if $\mathbb{E}\left[U_{t} \mid S_{t}=s\right]=v_{\pi}(s)$, for each $t$, then $\mathbf{w}_{t}$ is guaranteed to converge to a local optimum under the usual stochastic approximation conditions for decreasing $\alpha$. Because the true value of a state is the expected value of the return following it, the Monte Carlo target Ut =. Gt is by definition an unbiased estimate of $v_{\pi}(S_t)$. With this choice, the general SGD method above converges to a locally optimal approximation to $v_{\pi}(S_t)$. Thus, the gradient-descent version of Monte Carlo state-value prediction is guaranteed to find a locally optimal solution. ![[Gradient Monte Carlo to estimate state-value.png]] Bootstrapping methods are not in fact instances of true gradient descent (Barnard, 1993). They take into account the effect of changing the weight vector $\textbf{w}$ on the estimate, but ignore its effect on the target. They include only a part of the gradient, which is why they are called semi-gradient methods. Although semi-gradient (bootstrapping) methods do not converge as robustly as gradient methods, they do converge reliably in important cases such as the linear case. But they offer many advantages: - They typically enable significantly faster learning. - They enable learning to be continual and online, without waiting for the end of an episode. This enables them to be used on continuing problems and provides computational advantages. ![[Semi-gradient TD for estimating state value.png]] ## Linear Methods Linear methods approximate the state-value function by the inner product between weights and feature vector. $ \hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^{\top} \mathbf{x}(s) \doteq \sum_{i=1}^{d} w_{i} x_{i}(s) $ The gradient of the approximate value function is given as: $ \nabla \hat{v}(s, \mathbf{w})=\mathbf{x}(s) $ Therefore SGD update reduces to simple form: $ \mathbf{w}_{t+1} \doteq \mathbf{w}_{t}+\alpha\left[U_{t}-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)\right] \mathbf{x}\left(S_{t}\right) $ In particular, in the linear case there is only one optimum (or, in degenerate cases, one set of equally good optima), and thus any method that is guaranteed to converge to or near a local optimum is automatically guaranteed to converge to or near the global optimum. The gradient Monte Carlo algorithm from above converges to the global optimum of the VE under linear function approximation if learning step is reduced over time according to the usual conditions. The semi-gradient TD(0) algorithm also converges under linear function approximation, but this does not follow from general results on SGD; a separate theorem is necessary. The weight vector converged to is also not the global optimum, but rather a point near the local optimum: $ \mathbf{w}_{\mathrm{TD}} \doteq \mathbf{A}^{-1} \mathbf{b} $ where $ \mathbf{b} \doteq \mathbb{E}\left[R_{t+1} \mathbf{x}_{t}\right] \in \mathbb{R}^{d} \quad \text { and } \quad \mathbf{A} \doteq \mathbb{E}\left[\mathbf{x}_{t}\left(\mathbf{x}_{t}-\gamma \mathbf{x}_{t+1}\right)^{\top}\right] \in \mathbb{R}^{d \times d} $ This quantity is called the TD fixed point. In fact linear semi-gradient TD(0) converges to this point. ## Least-Squares TD Least-Squares TD or LSTD directly computes the TD fixed point. $ \widehat{\mathbf{A}}_{t} \doteq \sum_{k=0}^{t-1} \mathbf{x}_{k}\left(\mathbf{x}_{k}-\gamma \mathbf{x}_{k+1}\right)^{\top}+\varepsilon \mathbf{I} \quad \text { and } \quad \widehat{\mathbf{b}}_{t} \doteq \sum_{k=0}^{t-1} R_{k+1} \mathbf{x}_{k} $ $ \mathbf{w}_{t} \doteq \widehat{\mathbf{A}}_{t}^{-1} \widehat{\mathbf{b}}_{t} $ This algorithm is the most data efficient form of linear $\mathrm{TD}(0)$, but it is also more expensive computationally. Recall that semi-gradient $\mathrm{TD}(0)$ requires memory and perstep computation that is only $O(d)$. ![[LSTD.png]] ## Episodic Semi-gradient Sarsa The semi-gradient prediction for action values can be extended to on-policy control as well. The update for episodic semi-gradient on step Sarsa is: $ \mathbf{w}_{t+1} \doteq \mathbf{w}_{t}+\alpha\left[R_{t+1}+\gamma \hat{q}\left(S_{t+1}, A_{t+1}, \mathbf{w}_{t}\right)-\hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t}\right)\right] \nabla \hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t}\right) $ ![[Semi-gradient SARSA.png]] For n-step variant, we have: $ G_{t: t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^{n} \hat{q}\left(S_{t+n}, A_{t+n}, \mathbf{w}_{t+n-1}\right), \quad t+n<T $ with $G_{t: t+n} \doteq G_{t}$ if $t+n \geq T$, as usual. The $n$-step update equation is $ \mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1}+\alpha\left[G_{t: t+n}-\hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t+n-1}\right)\right] \nabla \hat{q}\left(S_{t}, A_{t}, \mathbf{w}_{t+n-1}\right), \quad 0 \leq t<T $ ![[Semi gradient n-step Sarsa.png]] --- ## References 1. Chapter 9 & 10, RL:AI 2nd Edition, Sutton and Barto