Benchmarking Model-Based Reinforcement Learning

# Benchmarking Model-Based Reinforcement Learning Paper: https://arxiv.org/pdf/1907.02057.pdf (Survey Paper, ICML 2019) Website: http://www.cs.toronto.edu/~tingwuwang/mbrl.html Model-bias: - Modelling errors cripple the effectiveness of these algorithms, resulting in policies that exploit the deficiencies of the models, which is known as model-bias [[#^91649c| [1] ]] - Recent work alleviate the model-bias problem by characterizing the uncertainty of the learned models by the means of probabilistic models and ensembles. [[#^1199c6|[2] ]][[#^7579c6| [3] ]][[#^05604e| [4] ]] Dynamics Learning: - MBRL algorithms are characterized by learning a model of the environment. - After repeated interactions with the environment, the experienced transitions are stored in a dataset $\mathcal{D}=\left\{\left(s_{t}, a_{t}, s_{t+1}\right)\right\}$ which is then used to learn a dynamics function $\tilde{f}_{\phi}$ - if ground-truth dynamics deterministic, the learned dynamics function predicts the next state. - if stochastic, common to represent the dynamics with a Gaussian distribution $p\left(s_{t+1} \mid a_{t}, s_{t}\right) \sim \mathcal{N}\left(\mu\left(s_{t}, a_{t}\right), \Sigma\left(s_{t}, a_{t}\right)\right)$ ## Dyna-style algorithms - learn policies using model-free algorithms with rich imaginary experience without interaction with the real environment. - Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) [[#^05604e| [4] ]] - uses an ensemble of neural networks to model the dynamics, which effectively combats model-bias - In the policy improvement step, the policy is updated using [[TRPO - Trust-Region Policy Optimization]], on experience generated by the learned dynamics models. - Stochastic Lower Bound Optimization (SLBO) [[#^757c77| [5] ]] - variant of ME-TRPO with theoretical guarantees of monotonic improvement. - instead of using single-step squared L2 loss, SLBO uses a multi-step L2-norm loss to train the dynamics. - Model-Based Meta-Policy-Optimzation (MB-MPO) [[#^7579c6| [3] ]] - forgoes the reliance on accurate models by meta-learning a policy that is able to adapt to different dynamics. - learnings ensemble of NNs, but each model in the ensemble is considered as a different task to meta-train - learns policy adapts to any of the different dynamics of the ensemble, which is more robust against model-bias. ### Results - MB-MPO surpasses the performance of ME-TRPO in most of the environments and achieves the best performance in domains like HalfCheetah - perform the best when the horizon is short - SLBO can solve MountainCar and Reacher very effi- ciently, but more interestingly in complex environment it achieves better performance than ME-TRPO and MB-MPO, except for in SlimHumanoid. - This category of algorithms is not efficient to solve long horizon complex domains due to the compounding error effect. - MB-MPO presents to be very robust against noise. ## Policy search with BPTT - policy search with backpropagation through time exploits the model derivatives - are able to compute the analytic gradient of the RL objective with respect to the policy, and improve the policy accordingly. - Probabilistic Inference for Learning Control (PILCO) [11, 12, 25] - [[Gaussian Processes]] are used to model the dynamics of the environment. - The training process iterates between collecting data using the current policy and improving the policy. - Inference in GPs does not scale. - Stochastic Value Gradients (SVG) [[#^868439| [6] ]] - SVG tackles the problem of compounding model errors by using observations from the real environment, instead of the imagined one. - dynamics models in SVG are probabilistic - policy is improved by computing the analytic gradient of the real trajectories with respect to the policy - Re-parametrization trick is used to permit backpropagation ### Results - For the majority of the tasks, SVG does not have the best sample efficiency. But for Humanoid environments, SVG is very effective compared with other MBRL algorithms - SVG which uses real observations and a value function to look into future returns, so able to surpass other MBRL algorithms in these high-dimensional domains. - PILCO fails to solve most of the other environments with bigger episode length and observation size, abd is unstable. ## Shooting Algorithms - provide a way to approximately solve the receding horizon problem - popularity has increased with the use of neural networks for modelling dynamics - Random Shooting (RS) [[#^ad8982| [8] ]] [[#^6ed903| [9] ]] - RS optimizes the action sequence $\boldsymbol{a}_{t: t+\tau}$ to maximize the expected planning reward under the learned dynamics model, i.e., $\max _{\boldsymbol{a}_{t: t+\tau}} \mathbb{E}_{s_{t}^{\prime} \sim \tilde{f_{\phi}}}\left[\sum_{t^{\prime}=t}^{t+\tau} r\left(s_{t}^{\prime}, a_{t}^{\prime}\right)\right] .$ - agent generates K candidate random sequences of actions from a uniform distribution, and evaluates each candidate using the learned dynamics - The optimal action sequence is approximated as the one with the highest return. - Only applies the first action from the optimal sequence and re-plans at every time-step. - Mode-Free Model-Based (MB-MF) [[#^c75bc6| [7] ]] - Random shooting has worse asymptotic per- formance when compared with model-free algorithms - first train a RS controller $\pi_{RS}$, nd then distill the controller into a neural network policy πθ using DAgger [41], which minimizes DKL (πθ (st ), πRS ) - policy is fine-tuned using standard model-free algorithms ([[TRPO - Trust-Region Policy Optimization]] used) - Probabilistic Ensembles with Trajectory Sampling (PETS-RS and PETS-CEM) [[#^1199c6| [2] ]] - Dynamics are modelled by an ensemble of probabilistic neural networks models, which captures both epistemic uncertainty from limited data and network capacity, and aleatoric uncertainty from the stochasticity of the ground-truth dynamics - Except for the difference in modeling the dynamics, PETS-RS is the same as RS - PETS-CEM, the online optimization problem is solved using cross-entropy method (CEM) [9, 4] to obtain a better solution. ### Results - RS is very effective on simple tasks such as InvertedPendulum, CartPole - as task difficulty increases RS gradually gets surpassed by PETS-RS and PETS-CEM - indicates that modelling uncertainty aware dynamics is crucial for the performance of shooting algorithms - PETS-CEM is better than PETS-RS in most of the environments - MB-MF can jump out of performance local-minima in MountainCar - Shooting algorithms are effective and robust across different environments. - Good with noisy environment, suggesting that re-planning successfully compensates for the uncertainty. ## Challenges of MBRL ### Dynamics bottleneck - the performance does not increase when more data is collected - models with learned dynamics get stuck at performance local minima significantly worse than using ground-truth dynamics - The prediction error accumulates with time, and MBRL inevitably involves prediction on unseen states. - The policy and the learning of dynamics is coupled, which makes the agents more prone to performance local-minima. - exploration and off-policy learning are barely addressed on current model-based approaches. ### Planning horizon dilemma - while increasing the planning horizon provides more accurate reward estimation, it can result in performance drops - planning horizon between 20 to 40 works the best both for the models using ground-truth dynamics and the ones using learned dynamics. - result of insufficient planning in a search space which increases exponentially with planning depth, i. e., the curse of dimensionality - ### Early termination dilemma - Early termination, when the episode is finalized before the horizon has been reached, is a standard technique used in MFRL algorithms to prevent the agent from visiting unpromising states or damaging states for real robots. - MBRL can correspondingly also apply early termination in the planned trajectories, or generate early terminated imaginary data, but hard to integrate into the existing MB algorithms. - early termination does in fact decrease the performance for MBRL algorithms of different types - to perform efficient learning in complex environments, such as Humanoid, early termination is almost necessary. important request for research. --- ## References 1. Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011. ^91649c 2. Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforce- ment learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114, 2018. ^1199c6 3. Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. CoRR, abs/1809.05214, 2018. ^7579c6 4. Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018. ^05604e 5. Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees. ICLR, 2019. ^757c77 6. Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015. ^868439 7. Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint arXiv:1708.02596, 2017. ^c75bc6 8. Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528, 2009. ^ad8982 9. Arthur George Richards. Robust constrained model predictive control. PhD thesis, Mas- sachusetts Institute of Technology, 2005. ^6ed903