# AlphaGo ## The problem - Perfect information games contain approximately $b^d$ possible sequences of moves, where b is breadth (legal moves per position) and d is depth (game length). - In large games, such as chess (b ≈ 35,d ≈ 80) 1 and especially Go (b ≈ 250,d ≈ 150) 1, exhaustive search is infeasible. - This means the search space is enormous and its extremely difficult to evaluate board position and moves. ## The solution - This enormous search space can be reduced by using neural networks - The depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function $v(s) \sim v_∗(s)$ that predicts the outcome from state s. This is done by *value network*. - The breadth of the search may be reduced by sampling actions from a policy $p(a|s)$ that is a probability distribution over possible moves $a$ in position $s$. This sampling is done with *policy network*. - Train a supervised learning (SL) policy network, $p_\sigma$, directly from expert human moves as a classification problem. Also train a fast policy $p_\pi$ (using simpler model and features) that can rapidly sample actions during rollouts. - Next, train a reinforcement learning (RL) policy network $p_\rho$ that improves SL policy network with self-play using [[Policy Gradient]].This network outputs a probability distribution $p_{\sigma}(a \mid s)$ or $p_{\rho}(a \mid s)$ over legal moves $a$, represented by a probability map over the board. The self-play also generates a new synthetic dataset which can be used to train the value network. - Finally, train a value network $v_\theta(s)$ by regression from the self-play dataset that predicts the scaler expected outcome of the game for the position $s$ during self-play. - The policy and value network are combined using an [[Monte-Carlo Tree Search]] algorithm, which selects actions by look-ahead serach. ## The details ### Supervised learning of Policy Networks - The SL policy network $p_{\sigma}(a \mid s)$ uses 13 layer convolutional policy network. The input is a simple representation of the board state. Output is a softmax over all legal moves a. - Trained on randomly sampled (s,a) pairs using gradient ascent to minimize the likelihood of human move a selected in state s $ \Delta \sigma \propto \frac{\partial \log p_{\sigma}(a \mid s)}{\partial \sigma} $ - With 30 million KGS server dataset, it achieves accuracy of 57.0% (44.4% SOTA at the time) on a held out test set, using all input features, and 55.7% using only raw board position and move history as inputs. - Faster rollout policy network $p_{\pi}(a \mid s)$ is trained using linear softmax of small pattern features and obtains 24.2%. However, sampling is 2us to select an action, rather than 3ms for policy network. ### Reinforcement learning of Policy Networks - The RL policy network $p_\rho$ is initialized with weights of SL network. - Self-play takes place between the current policy network $p_\rho$ and a randomly selected previous iteration of the policy, which stabilizes the training by preventing overfitting to the current policy. - Reward function is zero for all non-terminal steps, and +1 for winning -1 for losing at the end of the game. - Weights updated at time-step t by stochastic gradient ascent to maximize the expected outcome $ \Delta \rho \propto \frac{\partial \log p_{\rho}\left(a_{t} \mid s_{t}\right)}{\partial \rho} z_{t} $ - The RL policy network wins more than 80% of games against the SL policy network. Using no search at all, it won 85% of games against Pachi, an MCTS based Go SOTA program that executes 100,000 sims per move. ### Reinforcement learning of Value Networks - Approximate the value function using a value network $v_{\theta}(s)$ with weights $\theta$, $v_{\theta}(s) \approx v^{p_{\rho}}(s) \approx v^{*}(s) .$ The neural network has a similar architecture to the policy network, but outputs a scalar. - Trained on state-outcome pairs (s,z) using SGD to minimize **MSE** between predicted value and corresponding outcome. $ \Delta \theta \propto \frac{\partial v_{\theta}(s)}{\partial \theta}\left(z-v_{\theta}(s)\right) $ - Predicting game outcomes from data consisting of complete games (KGS dataset) lead to overfitting, problem is successive positions are strongly correlated, differing by just one stone by regression target is shared for entire game. Value network memorizes the game outcomes rather than generalizing. - To tackle this, self-play dataset was generated with 30 million distinct positions, each sampled from a separate game, leading to minimal overfitting. - Position evaluation accuracy of the value network, compared to MC rollouts using fast rollout policy, was consistently more accurate. A single evaluation of value network approached accuracy of MC rollouts using RL policy network $p_\rho$ but using 15,000 times less computation. ### Searching with Policy and Value Networks - Policy and value networks are combined in an [[Monte-Carlo Tree Search]] algorithm that selects actions by lookahead search. ![[mcts-alphago.jpg]] - Each edge (s,a) of the serach tree stores $\left\{P(s, a), N_{v}(s, a), N_{r}(s, a), W_{v}(s, a), W_{r}(s, a), Q(s, a)\right\}$ where $P(s, a)$ is the prior probability, $W_{v}(s, a)$ and $W_{r}(s, a)$ are Monte-Carlo estimates of total action-value, accumulated over $N_{v}(s, a)$ and $N_{r}(s, a)$ leaf evaluations and rollout rewards respectively, and $Q(s, a)$ is the combined mean action-value for that edge. - The tree is traversed by simulation (i.e. by descending the tree in complete games without backup), starting from the root state. At each time-step t of each simulation, an action a_t is selected from state s_t, $ a_{t}=\underset{a}{\operatorname{argmax}}\left(Q\left(s_{t}, a\right)+u\left(s_{t}, a\right)\right) $ where, $u(s, a) \propto \frac{P(s, a)}{1+N(s, a)}$ is a quantity that is proportional to the prior but decays with repeated visits to encourage exploration. - When leaf node $s_L$ is reached, it may be expanded. It is processed by the SL policy network (just once), and the output probabilities are stored as prior probabilities. - The leaf node is now evaluated in two ways: by value network $v(s_L)$, and by the outcome $z_L$ of a random rollout played out until end using fast rollout policy $p_\pi$. These evaluations are combined using linear combitnation $V\left(s_{L}\right)=(1-\lambda) v_{\theta}\left(s_{L}\right)+\lambda z_{L}$ - At the end of simulation $n$ the action values and visit counts of all traversed edges are updated. Each edge accumulates the visit count and mean evaluation of all simulations passing through that edge, $ \begin{array}{l} N(s, a)=\sum_{i=1}^{n} \mathbf{1}(s, a, i) \\ Q(s, a)=\frac{1}{N(s, a)} \sum_{i=1}^{n} \mathbf{1}(s, a, i) V\left(s_{L}^{i}\right) \end{array} $ where $s_{L}^{i}$ is the leaf node from the $i$ th simulation, and $1(s, a, i)$ indicates whether an edge $(s, a)$ was traversed during the $i$ th simulation. - Once the search is complete, the algorithm chooses the most visited move from the root position. ## The results - The SL policy network $p_{\sigma}$ performed better in AlphaGo than the stronger RL policy network $p_{\rho},$ presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move. However, the value function $v_{\theta}(s) \approx v^{p_{\rho}}(s)$ derived from the stronger $\mathrm{RL}$ policy network performed better in AlphaGo than a value function $v_{\theta}(s) \approx v^{p_{\sigma}}(s)$ derived from the SL policy network. - Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte-Carlo evaluation in Go. - However, the mixed evaluation (λ = 0.5) performed best, winning ≥ 95% against other (software) variants. - Defeated world champion Lee Sedol. --- ## References 1. Mastering the game of Go with deep neural networks and tree search https://www.researchgate.net/publication/292074166_Mastering_the_game_of_Go_with_deep_neural_networks_and_tree_search