Model Based Reinforcement Learning

# Model Based Reinforcement Learning Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. MFRL's high sample complexity limits largely their application to simulated domains. Advantages: - Can efficiently learn model by supervised learning methods - Can reason about model uncertainty (like in upper confidence bound methods for exploration/exploitation trade offs) - Generalization - If the dynamics (reward) of the environment change, it can use the learned-model and replan. - Incorporate uncertainity - helps exploration/exploitation Disadvantages - First learn a model, then construct a value function -> two sources of approximation error However, there are significant challenges. ## Challenges of MBRL These are the challenges identified by [[Benchmarking Model-Based Reinforcement Learning]]: ### Dynamics bottleneck - The performance does not increase when more data is collected - Models with learned dynamics get stuck at performance local minima significantly worse than using ground-truth dynamics. - The prediction error accumulates with time, and MBRL inevitably involves prediction on unseen states. - The policy and the learning of dynamics is coupled, which makes the agents more prone to performance local-minima. - Exploration and off-policy learning are barely addressed on current model-based approaches. ### Planning horizon dilemma - While increasing the planning horizon provides more accurate reward estimation, it can result in performance drops. - Planning horizon between 20 to 40 works the best both for the models using ground-truth dynamics and the ones using learned dynamics. - This can be attributed to insufficient planning in a search space which increases exponentially with planning depth, i. e., the curse of dimensionality. ### Early termination dilemma - Early termination, when the episode is finalized before the horizon has been reached, is a standard technique used in MFRL algorithms to prevent the agent from visiting unpromising states or damaging states for real robots. - MBRL can correspondingly also apply early termination in the planned trajectories, or generate early terminated imaginary data, but hard to integrate into the existing MB algorithms. - Early termination does in fact decrease the performance for MBRL algorithms of different types. - To perform efficient learning in complex environments, such as Humanoid, early termination is almost necessary. This is an important area for research. ## Types of model based techniques ### Analytic gradient based ### Sampling-based planning - ### Model-based data generation ### Value-equivalence prediction --- ## References 1. https://bair.berkeley.edu/blog/2019/12/12/mbpo/