Incremental Implementation of Estimating Action Values

# Incremental Implementation of Estimating Action Values When calculating averages, the naive approach would be maintain a record of all rewards and perform compuation when estimated value is needed. This doesn't scale with the number of rewards seen. It is simple to devise incremental formulas instead. Given $Q_{n}$ and the $n$th reward, $R_{n}$, the new average of all $n$ rewards can be computed by $ \begin{aligned} Q_{n+1} &=\frac{1}{n} \sum_{i=1}^{n} R_{i} \\ &=\frac{1}{n}\left(R_{n}+\sum_{i=1}^{n-1} R_{i}\right) \\ &=\frac{1}{n}\left(R_{n}+(n-1) \frac{1}{n-1} \sum_{i=1}^{n-1} R_{i}\right) \\ &=\frac{1}{n}\left(R_{n}+(n-1) Q_{n}\right) \\ &=\frac{1}{n}\left(R_{n}+n Q_{n}-Q_{n}\right) \\ &=Q_{n}+\frac{1}{n}\left[R_{n}-Q_{n}\right] \end{aligned} $ This update is of a form that occurs frequently in [[Reinforcement Learning]] problems. The general form is $ \text { NewEstimate } \leftarrow \text { OldEstimate }+\text { StepSize }[\text { Target }-\text { OldEstimate }] $ The expression $[\text { Target }-\text { OldEstimate }]$ is an *error* in the estimate. --- ## References 1. Chapter 2.4, RL:AI, Sutton and Barto 2nd Edition