# Monte-Carlo Tree Search ![[mcts_steps.jpg]] MCTS solves the problem of having to construct computationally intractable full expectiminimax tree by slowly filling in the tree. MCTS estimates the optimal value of interior nodes by a double approximation, $V^{n}(s) \approx v^{P^{n}}(s) \approx v^{*}(s)$. The first approximation, $V^{n}(s) \approx v^{P^{n}}(s),$ uses $n$ Monte-Carlo simulations to estimate the value function of a simulation policy $P^{n}$. The second approximation, $v^{P^{n}}(s) \approx v^{*}(s)$, uses a simulation policy $P^{n}$ in place of minimax optimal actions. Simulating an episode involves two phases (in-tree, out-of-tree) - Tree policy: We have been here before, what to do? pick actions for tree nodes to maximize $Q(S, A)$ - Roll out policy: We've never reached the node, what to do? e.g. pick actions randomly, or another policy? [[Monte-Carlo Tree Search#Upper Confidence Tree Search]] To evaluate the value of a tree node $i$ at state action pair $(s, a)$, average over all rewards received from that node onwards across simulated episodes in which this tree node was reached $ Q(i)=\frac{1}{N(i)} \sum_{k=1}^{K} \sum_{u=t}^{T} \mathbb{1}(i \in e p i \cdot k) G_{k}(i) \stackrel{P}{\rightarrow} q(s, a) $ After search is finished, select current (real) action with maximum value in search tree $ a_{t}=\underset{a \in A}{\operatorname{argmax}} Q\left(s_{t}, a\right) $ As more simulations are executed and the search tree grows deeper, the simulation policy becomes informed by increasingly accurate statistics. In the limit, both approximations become exact and MCTS (e.g., with UCT) converges to the optimal value function $\lim _{n \rightarrow \infty} V^{n}(s)= \lim _{n \rightarrow \infty} v^{P^{n}}(s)=v^{*}(s)$. ## Upper Confidence Tree Search How to select what action to take during a simulated episode? UCT borrows idea from bandit literature and treats each node where we can select actions as a [[Multi-Armed Bandits#Upper-Confidence-Bound Action Selection UCB]] problem. Maintain an upper confidence bound over reward of each arm: $ Q(s, a, i)=\frac{1}{N(s, a, i)} \sum_{k=1}^{K} \sum_{u=t}^{T} \mathbb{1}(i \in e p i . k) G_{k}(s, a, i)+c \sqrt{\frac{\ln (n(s))}{n(s, a)}} $ For simplicity can treat each state node as a separate MAB. For simulated episode $k$ at node $i$, select action/arm with highest upper bound to simulate and expand (or evaluate) in the tree $ a_{i k}=\arg \max Q(s, a, i) $ This implies that the policy used to simulate episodes with (and expand/update the tree) can change across each episode. ## Advantages - Highly selective best-first search - Evaluates states dynamically (unlike e.g. [[Dynamic Programming (RL)]]) - Uses sampling to break curse of dimensionality - Works for "black-box" models (only requires samples) - Computationally efficient, anytime, parallelizable --- ## References 1. Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 16 https://www.youtube.com/watch?v=vDF1BYWhqL8 2. [[AlphaGo]] 3. Mcts.ai https://web.archive.org/web/20180623055344/http://mcts.ai/about/index.html