Multi-Armed Bandits - Notes on AI

# Multi-Armed Bandits Objective is to maximize reward over a given # of time steps. For example: choosing slot machine levers to pull, doctor choosing an experimental prescription. ## Reward distribution Reward distribution can be stationary and non-stationary. Does the reward of pulling "lever 2 change over time? If you keep playing the same move, the opponent will use that against you. ## Estimating value Through repeated actions you maximize your winning by pulling the best levers? How to find the best levers? Estimating values of each action (below: sample-average method) $ Q_{t}(a) \doteq \frac{\text { sum of rewards when } a \text { taken prior to } t}{\text { number of times } a \text { taken prior to } t}=\frac{\sum_{i=1}^{t-1} R_{i} \cdot \mathbb{1}_{A_{i}=a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_{i}=a}} $ ## Action selection Greedy action selection $A_{t} \doteq \underset{a}{\operatorname{argmax}} Q_{t}(a)$. With noisier and non-stationary rewards, more exploration is required so use epsilon-greedy. ## Efficient-sample averaging [[Incremental Implementation of Estimating Action Values]] New Estimate $\leftarrow$ Old Estimate $+$ StepSize $[$ Target $-$ Old Estimate $]$ ## Simple Bandit Algorithm Initialize, for $a=1$ to $k:$ $Q(a) \leftarrow 0$ $N(a) \leftarrow 0$ Loop forever: - $A \leftarrow\left\{\begin{array}{ll}\arg \max _{a} Q(a) & \text { with prob} 1 -\epsilon \\ \text { a random action } & \text { with prob } \epsilon \end{array}\right.$ - $R \leftarrow$ bandit $(A)$ - $N(A) \leftarrow N(A)+1$ - $Q(A) \leftarrow Q(A)+\frac{1}{N(A)}[R-Q(A)]$ ## Exponential Recency-weighted Average $ \begin{aligned} Q_{n+1}=& Q_{n}+\alpha\left[R_{n}-Q_{n}\right] \\ =& \alpha R_{n}+(1-\alpha) Q_{n} \\ =& \alpha R_{n}+(1-\alpha)\left[\alpha R_{n-1}+(1-\alpha) Q_{n-1}\right] \\ =& \alpha R_{n}+(1-\alpha) \alpha R_{n-1}+(1-\alpha)^{2} Q_{n-1} \\ =& \alpha R_{n}+(1-\alpha) \alpha R_{n-1}+(1-\alpha)^{2} \alpha R_{n-2}+ \cdots+(1-\alpha)^{n-1} \alpha R_{1}+(1-\alpha)^{n} Q_{1} \\ =&(1-\alpha)^{n} Q_{1}+\sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} R_{i} \end{aligned} $ ## Upper-Confidence-Bound Action Selection (UCB) How should we select amongst non-greedy actions? It would be better to weigh the actions based on how many times they have been tested and their previous returns $ A_{t} \doteq \underset{a}{\operatorname{argmax}}\left[Q_{t}(a)+c \sqrt{\frac{\ln t}{N_{t}(a)}}\right] $ UCB is more difficult than epsilon-greedy to extend beyond bandits to more general RL problems (nonstationary problems, large state spaces). The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with the c > 0 parameter determining the confidence level. Each time a is selected the uncertainty is presumably reduced; Nt(a) is incremented and, as it appears in the denominator of the uncertainty term, the term is decreased. On the other hand, each time an action other a is selected t is increased; as it appears in the numerator the uncertainty estimate is increased. The use of the natural logarithm means that the increase gets smaller over time, but is unbounded; all actions will eventually be selected, but as time goes by it will be a longer wait, and thus a lower selection frequency, for actions with a lower value estimate or that have already been selected more times. --- ## References 1. Chapter 2, RL:AI, Sutton and Barto 2nd Edition