# Monte-Carlo RL Methods - Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. - Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. - Learning from actual experience is striking because it requires no prior knowledge of the environment’s dynamics, yet can still attain optimal behavior. - Monte Carlo methods sample and average returns for each state–action pair much like the [[Multi-Armed Bandits]] sample and average rewards for each action. The main difference is that now there are multiple states, each acting like a different bandit problem (like an associative-search or contextual bandit) and the different bandit problems are interrelated. - Benefits over [[Dynamic Programming (RL)]] methods: - DP methods require full environment dymanics (transition probabilities) which must be computed before DP can be applied, whereas generating sample games required by MC methods is easy. - Estimates for each state are independant. In other words, Monte Carlo methods do not bootstrap unlike DP. - When one requires the value of only one or subset of states, MC can calculate averaging returns only from these states. ## First and Every visit MC for estimating state value - The **first-visit MC** method estimates v(s) as the average of the returns following first visits to s, whereas the **every-visit MC** method averages the returns following all visits to s. - ![[First Visit MC prediction.png]] - Both first-visit MC and every-visit MC converge to $v\pi(s)$ as the number of visits (or first visits) to s goes to infinity. - In first-visit MC case, each return is an independent, identically distributed estimate of $v_{\pi}(s)$ with finite variance. By the law of large numbers the sequence of averages of these estimates converges to their expected value. Each average is itself an unbiased estimate, and the standard deviation of its error falls as $1 / \sqrt{n}$, where $n$ is the number of returns averaged. - Every-visit MC is less straightforward, but its estimates also converge quadratically. ## Monte Carlo Exploring Starts for estimating policy - The problem with estimating action values with Monte Carlo is that many state-action pairs may never be visited, because in a deterministic policy, it only keeps track of one action. - One way to solve this is by specifying that the episodes start in a state–action pair, and that every pair has a nonzero probability of being selected as the start. This is called exploring starts. - In this case we have an action-value function, and therefore no model is needed to construct the greedy policy as $\pi(s) \doteq \arg \max _{a} q(s, a)$ - Then [[Dynamic Programming (RL)#Policy Improvement Theorem]] ensures $q_{\pi_{k}}\left(s, \pi_{k+1}(s)\right)\geq v_{\pi_{k}}(s)$, ensuring overall process converges to optimal policy. - We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method: - Episodes have exploring starts. - Policy evaluation could be done with an infinite number of episodes. - One solution is to give up trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward $q_{\pi_{k}}$, but we do not expect to actually get close except over many steps. This is like [[Dynamic Programming (RL)#Value Iteration]]. - The final algorithm is given as: ![[Monte Carlo ES.png]] ## MC without exploring starts - Two possibilities: - Change policy update to move "towards" greedy policy, but keep exploring. This still satisfies [[Dynamic Programming (RL)#Generalized Policy Iteration]]. We use data from the policy we are updating: **on-policy** - Use two policies: non-greedy behaviour policy and greedy target policy. We use data from a different policy than we are updaing: **off-policy** ## Off-policy prediction via importance sampling --- ## References 1. Chapter 5, RL:AI 2nd Edition, Sutton and Barto