# Dyna-Q - Planning and Learning ## Models Some models produce a description of all possibilities and their probabilities; these we call distribution models. Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models. Given a starting state and a policy, a sample model could produce an entire episode, and a distribution model could generate all possible episodes and their probabilities. In either case, we say the model is used to *simulate* the environment and produce *simulated experience*. $ \text { model } \longrightarrow \text { planning } \longrightarrow \text { policy } $ Two basic ideas: 1. All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy. 2. They compute value functions by updates or backup operations applied to simulated experience. $ \text { model } \longrightarrow \text { experience } \stackrel{\text { simulated }}{\text { exups }} \longrightarrow \text { values } \longrightarrow \text { policy } $ ## Dyna-Q When planning is done online, while interacting with the environment, a number of interesting issues arise: - New information gained from the interaction may change the model and thereby interact with planning. - It may be desirable to customize the planning process in some way to the states or decisions currently under consideration, or expected in the near future. - If decision making and model learning are both computation-intensive processes, then the available computational resources may need to be divided between them. Dyna-Q is a simple architecture integrating the major functions needed in an online planning agent. Within a planning agent, there are at least two roles for real experience: 1. it can be used to improve the model (to make it more accurately match the real environment). This is called model-learning. 2. It can be used to directly improve the value function and policy using different kinds of reinforcement learning methods. This is called direct reinforcement learning. ![[Planning and learning.png]] Dyna-Q includes all of the processes shown in the diagram above—planning, acting, model-learning, and direct RL—all occurring continually. The model-learning method is also table-based and assumes the environment is deterministic. If the model is queried with a state–action pair that has been experienced before, it simply returns the last-observed next state and next reward as its prediction. The direct RL method is one-step tabular Q-learning. The planning method is the random-sample one-step tabular Q-planning method. ![[Q-Planning.png]] Learning and planning are deeply integrated in the sense that they share almost all the same machinery, differing only in the source of their experience. ![[Dyna-Q.png]] Tabular Dyna-Q algoritm is shown below: ![[Tabular Dyna-Q.png]] ## Dyna-Q+ The Dyna-Q+ agent keeps track for each state-action pair of how many time steps have elapsed since the pair was last tried in a real interaction with the environment. The more time that has elapsed, the greater (we might presume) the chance that the dynamics of this pair has changed and that the model of it is incorrect. To encourage behavior that tests long-untried actions, a special "bonus reward" is given on simulated experiences involving these actions. In particular, if the modeled reward for a transition is $r$, and the transition has not been tried in $\tau$ time steps, then planning updates are done as if that transition produced a reward of $r+\kappa \sqrt{\tau}$, for some small $\kappa$. This encourages the agent to keep testing all accessible state transitions and even to find long sequences of actions in order to carry out such tests. ${ }^{1}$ Of course all this testing has its cost, but in many cases, as in the shortcut maze, this kind of computational curiosity is well worth the extra exploration. --- ## References