# AlphaZero AlphaZero generalizes [[AlphaGo]] algorithm to many challenging games starting from random play and no domain knowledge except the game rules. ## The problem - Programs that play Chess and other games at super-human level are highly tuned to their domain, and cannot be generalized to other games without substantial human effort, whereas general game-playing systems remain comparatively weak. - [[AlphaGo]] was specifically designed to play Go and trained with human gameplay dataset, which makes it less generalizable. ## The solution - Instead of a handcrafted evaluation function and move ordering heuristics, AlphaZero uses a deep neural network $(\mathbf{p}, v)=f_{\theta}(s)$ with parameters $\theta .$ This neural network $f_{\theta}(s)$ takes the board position $s$ as an input and outputs a vector of move probabilities $\mathbf{p}$ with components $p_{a}=\operatorname{Pr}(a \mid s)$ for each action $a$, and a scalar value $v$ estimating the expected outcome $z$ of the game from position $s, v \approx \mathbb{E}[z \mid s]$. - AlphaZero learns these move probabilities and value estimates entirely from self-play; these are then used to guide its search in future games. - It uses general purpose [[Monte-Carlo Tree Search]] which returns a vector $\pi$ representing a probability distribution over moves, $\pi_a = Pr(s|s_{root})$. - The parameters θ of the deep neural network in AlphaZero are trained by reinforcement learning from self-play games, starting from randomly initialized parameters θ. - Parameters are updated to minimize the error between the predicted outcome $v_t$ and the game outcome $z$ (MSE), and to maximize the similarity of the policy vector $p_t$ to the search probabilities $\pi_t$ (cross entropy), with $L_2$ regularization. $ l=(z-v)^{2}-\pi^{\top} \log \mathbf{p}+c|| \theta \|^{2} $ ## The details ### Differences with AlphaGo - Go games have a binary win and loss outcome - AlphaGo Zero estimated and optimized the probability of winning, exploiting the fact that Go games have a binary win or loss outcome. - However, both chess and shogi may end in drawn outcomes. AlphaZero instead estimates and optimizes the expected outcome. - The rules of Go are invariant to rotation and reflection - This fact was exploited in AlphaGo by 1. training data were augmented by generating eight symmetries for each position. 2. during MCTS, board positions were transformed by using a randomly selected rotation or reflection before being evaluated by the neural network - AlphaZero does not assume symmetry and does not augment the training data and does not transform the board position during MCTS. - The rules of Go are well-suited to convolutional architecture - AlphaGo exploits the translationally invariant property of go matching the weight sharing structure of convolutional networks. - By contrast, the rules of chess and shogi are position-dependent and include long-range interactions. Despite these differences, AlphaZero uses the same convolutional network architecture as AlphaGo Zero for chess, shogi and Go - AlphaGo had separate policy and value networks, AlphaZero has a single neural network with the same "body", but different "heads" for policy and value. - In AlphaGo Zero, self-play games were generated by the best player from all previous iterations. AlphaZero simply maintains a single neural network and self-play games are always generated by using the latest parameters for this neural network. ### Domain knowledge - AlphaZero was provided with the following domain knowledge about each game: - The input features describing the position, and the output features describing the move, are structured as a set of planes; i.e. the neural network architecture is matched to the grid-structure of the board. - AlphaZero is provided with perfect knowledge of the game rules. These are used during MCTS, to simulate the positions resulting from a sequence of moves, to determine game termination, and to score any simulations that reach a terminal state. - Knowledge of the rules is also used to encode the input planes (i.e. castling, repetition, no-progress) and output planes (how pieces move, promotions, and piece drops in shogi). - The typical number of legal moves is used to scale the exploration noise. - Chess and shogi games exceeding 512 steps were terminated and assigned a drawn outcome; Go games exceeding 722 steps were terminated and scored with Tromp-Taylor rules. - AlphaZero does not use human-play database or domain-specific heuristics. ## The results - In chess, AlphaZero first outperformed Stockfish after just 4 hours (300,000 steps); in shogi, AlphaZero first outperformed Elmo after 2 hours (110,000 steps); and in Go, AlphaZero first outperformed AlphaGo Lee after 30 hours (74,000 steps). ![[alphazero performance.jpg]] - AlphaZero searches just 60,000 positions per second in chess and shogi, compared with 60 million for Stockfish and 25 million for Elmo. AlphaZero may compensate for the lower number of evaluations by using its deep neural network to focus much more selectively on the most promising variations. --- ## References 1. Original paper - https://arxiv.org/abs/1712.01815 2. Original blogpost by Deepmind - https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go 3. A Simple Alpha(Go) Zero Tutorial https://web.stanford.edu/~surag/posts/alphazero.html