# Multi-Armed Bandits
Objective is to maximize reward over a given # of time steps. For example: choosing slot machine levers to pull, doctor choosing an experimental prescription.
## Reward distribution
Reward distribution can be stationary and non-stationary. Does the reward of pulling "lever 2 change over time? If you keep playing the same move, the opponent will use that against you.
## Estimating value
Through repeated actions you maximize your winning by pulling the best levers? How to find the best levers?
Estimating values of each action (below: sample-average method)
$
Q_{t}(a) \doteq \frac{\text { sum of rewards when } a \text { taken prior to } t}{\text { number of times } a \text { taken prior to } t}=\frac{\sum_{i=1}^{t-1} R_{i} \cdot \mathbb{1}_{A_{i}=a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_{i}=a}}
$
## Action selection
Greedy action selection $A_{t} \doteq \underset{a}{\operatorname{argmax}} Q_{t}(a)$. With noisier and non-stationary rewards, more exploration is required so use epsilon-greedy.
## Efficient-sample averaging
[[Incremental Implementation of Estimating Action Values]]
New Estimate $\leftarrow$ Old Estimate $+$ StepSize $[$ Target $-$ Old Estimate $]$
## Simple Bandit Algorithm
Initialize, for $a=1$ to $k:$
$Q(a) \leftarrow 0$
$N(a) \leftarrow 0$
Loop forever:
- $A \leftarrow\left\{\begin{array}{ll}\arg \max _{a} Q(a) & \text { with prob} 1 -\epsilon \\ \text { a random action } & \text { with prob } \epsilon \end{array}\right.$
- $R \leftarrow$ bandit $(A)$
- $N(A) \leftarrow N(A)+1$
- $Q(A) \leftarrow Q(A)+\frac{1}{N(A)}[R-Q(A)]$
## Exponential Recency-weighted Average
$
\begin{aligned}
Q_{n+1}=& Q_{n}+\alpha\left[R_{n}-Q_{n}\right] \\
=& \alpha R_{n}+(1-\alpha) Q_{n} \\
=& \alpha R_{n}+(1-\alpha)\left[\alpha R_{n-1}+(1-\alpha) Q_{n-1}\right] \\
=& \alpha R_{n}+(1-\alpha) \alpha R_{n-1}+(1-\alpha)^{2} Q_{n-1} \\
=& \alpha R_{n}+(1-\alpha) \alpha R_{n-1}+(1-\alpha)^{2} \alpha R_{n-2}+ \cdots+(1-\alpha)^{n-1} \alpha R_{1}+(1-\alpha)^{n} Q_{1} \\
=&(1-\alpha)^{n} Q_{1}+\sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} R_{i}
\end{aligned}
$
## Upper-Confidence-Bound Action Selection (UCB)
How should we select amongst non-greedy actions? It would be better to weigh the actions based on how many times they have been tested and their previous returns
$
A_{t} \doteq \underset{a}{\operatorname{argmax}}\left[Q_{t}(a)+c \sqrt{\frac{\ln t}{N_{t}(a)}}\right]
$
UCB is more difficult than epsilon-greedy to extend beyond bandits to more general RL problems (nonstationary problems, large state spaces).
The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with the c > 0 parameter determining the confidence level. Each time a is selected the uncertainty is presumably reduced; Nt(a) is incremented and, as it appears in the denominator of the uncertainty term, the term is decreased. On the other hand, each time an action other a is selected t is increased; as it appears in the numerator the uncertainty estimate is increased. The use of the natural logarithm means that the increase gets smaller over time, but is unbounded; all actions will eventually be selected, but as time goes by it will be a longer wait, and thus a lower selection frequency, for actions with a lower value estimate or that have already been selected more times.
---
## References
1. Chapter 2, RL:AI, Sutton and Barto 2nd Edition