# Benefits and challenges of different RL methods
## Actor Critic
## Benefits
## Challenges
- Although they deal with the problem of high varaince of stochastic gradients by using a value function rather than the empirical returns, this introduces bias which makes the algorithm diverge or converge to a poor solution.
-
## Policy Gradient
## Benefits
- They directly optimize the cumulative reward and can be used straightforwardly with non linear function approximation.
## Challenges
- Data efficiency is a problem as large number of samples are required as the variance is high.
- Can use value functions as baselines at cost of some bias, as proposed by [[Generalized Advantage Estimate]].
- Stable and steady training is difficult because nonstationarity of the incoming data.
- Bad policy leads to bad data, which is different than supervised learning where data and learning are independant.
- Can be dealt with to some degree by preventing the policy to change drastically by small number of samples, like with [[TRPO - Trust-Region Policy Optimization]] and [[PPO - Proximal Policy Optimization]].
## Natural Policy Gradient
Advantages
- Usually needs less training than regular policy gradients
- Inherits advantageous properties from vanilla gradients
Limitations
- Need Fisher information matrix
- Known for some standard distributions, e.g. Gaussian
- Inherits disadvantages from PG (e.g., high variance)
## Deterministic Policy Gradient
## Benefits
- In the stochastic case, the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space. As a result, computing the stochastic policy gradient may require more samples, especially if the action space has many dimensions.
- Shows good performance over stochastic policies in high dimensional action spaces.
## Challenges
- In order to explore the full state and action space, a stochastic policy is often necessary.
- Use off-policy algorithm to explore satisfactorily. Use stochastic behaviour policy to ensure exploration, but learn deterministic target policy.
---
## References