Benefits and challenges of different RL methods

# Benefits and challenges of different RL methods ## Actor Critic ## Benefits ## Challenges - Although they deal with the problem of high varaince of stochastic gradients by using a value function rather than the empirical returns, this introduces bias which makes the algorithm diverge or converge to a poor solution. - ## Policy Gradient ## Benefits - They directly optimize the cumulative reward and can be used straightforwardly with non linear function approximation. ## Challenges - Data efficiency is a problem as large number of samples are required as the variance is high. - Can use value functions as baselines at cost of some bias, as proposed by [[Generalized Advantage Estimate]]. - Stable and steady training is difficult because nonstationarity of the incoming data. - Bad policy leads to bad data, which is different than supervised learning where data and learning are independant. - Can be dealt with to some degree by preventing the policy to change drastically by small number of samples, like with [[TRPO - Trust-Region Policy Optimization]] and [[PPO - Proximal Policy Optimization]]. ## Natural Policy Gradient Advantages - Usually needs less training than regular policy gradients - Inherits advantageous properties from vanilla gradients Limitations - Need Fisher information matrix - Known for some standard distributions, e.g. Gaussian - Inherits disadvantages from PG (e.g., high variance) ## Deterministic Policy Gradient ## Benefits - In the stochastic case, the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space. As a result, computing the stochastic policy gradient may require more samples, especially if the action space has many dimensions. - Shows good performance over stochastic policies in high dimensional action spaces. ## Challenges - In order to explore the full state and action space, a stochastic policy is often necessary. - Use off-policy algorithm to explore satisfactorily. Use stochastic behaviour policy to ensure exploration, but learn deterministic target policy. --- ## References