# Off-policy learning with approximation
The extension of function approximation turns out to be significantly different and harder for on-policy learning than it is for on-policy learning.
The tabular on-policy methods readily extend to semi-gradient algorithms, but these algorithms do not converge as robustly as they do under on-policy training.
Recall that in on-policy learning we seek to learn a value function for a target policy $\pi$, given data due to a different behavior policy $b$.
In the control case, action values are learned, and both policies typically change during learning-$\pi$ being the greedy policy with respect to $\hat{q}$, and $b$ being something more exploratory such as the $\varepsilon$-greedy policy with respect to $\hat{q}$.
The challenge of off-policy learning comes from:
- Target of the update, which arises from the tabular case. [[Importance Sampling]] is used to deal with this.
- Distribution of the updates, which arises from function approximation.
Dealing with the first challenge, we have Semi-gradient off-policy TD(0):
$
\mathbf{w}_{t+1} \doteq \mathbf{w}_{t}+\alpha \rho_{t} \delta_{t} \nabla \hat{v}\left(S_{t}, \mathbf{w}_{t}\right)
$
where,
$
\rho_{t} \doteq \rho_{t: t}=\frac{\pi\left(A_{t} \mid S_{t}\right)}{b\left(A_{t} \mid S_{t}\right)}
$
and $\delta_{t}$ is defined appropriately depending on whether the problem is episodic and discounted, or continuing and undiscounted using average reward:
$\delta_{t} \doteq R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)$
$\delta_{t} \doteq R_{t+1}-\bar{R}_{t}+\hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)$
---
## References