Off-policy learning with approximation

# Off-policy learning with approximation The extension of function approximation turns out to be significantly different and harder for on-policy learning than it is for on-policy learning. The tabular on-policy methods readily extend to semi-gradient algorithms, but these algorithms do not converge as robustly as they do under on-policy training. Recall that in on-policy learning we seek to learn a value function for a target policy $\pi$, given data due to a different behavior policy $b$. In the control case, action values are learned, and both policies typically change during learning-$\pi$ being the greedy policy with respect to $\hat{q}$, and $b$ being something more exploratory such as the $\varepsilon$-greedy policy with respect to $\hat{q}$. The challenge of off-policy learning comes from: - Target of the update, which arises from the tabular case. [[Importance Sampling]] is used to deal with this. - Distribution of the updates, which arises from function approximation. Dealing with the first challenge, we have Semi-gradient off-policy TD(0): $ \mathbf{w}_{t+1} \doteq \mathbf{w}_{t}+\alpha \rho_{t} \delta_{t} \nabla \hat{v}\left(S_{t}, \mathbf{w}_{t}\right) $ where, $ \rho_{t} \doteq \rho_{t: t}=\frac{\pi\left(A_{t} \mid S_{t}\right)}{b\left(A_{t} \mid S_{t}\right)} $ and $\delta_{t}$ is defined appropriately depending on whether the problem is episodic and discounted, or continuing and undiscounted using average reward: $\delta_{t} \doteq R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)$ $\delta_{t} \doteq R_{t+1}-\bar{R}_{t}+\hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)$ --- ## References