# Deep Deterministic Policy Gradient (DDPG)
Deterministic because only takes the most appropriate action the policy network believe.
DDPG is an off-policy algorithm.
DDPG can only be used for environments with continuous action spaces.
DDPG can be thought of as being deep Q-learning for continuous action spaces.
![[DDPG Algorithm.png]]
Based on [[PGT Actor-Critic]]
- Uses two networks
- Actor(observation) -> action
- Critic(observation, action) -> Q (reward + discounted Q_next)
Replay buffer stores obs, actions, reward, obs next
Training actor network
- assume we have trained critic network. Then using the action from actor network get Q value from citic network
- maximize Q
Training critic network
- feed next observation to actor and get next action
- get Q next from the critic network
- minimize Q - (reward + discounted Q_next)
To solve this reliance of the networks on each other to train, DDPG uses a target network.
When training the actor, freezes the critic and updates the actor.
When training critic, freezes the target network and gets Q_next to do the update.
Update the target network using moving average of parameters individual, an instance of [[Multi-Network Training with Moving Average Target]].
Can only start with completely random actions.
When actor is being trained, actor is likely to generate be more 'reasonable' actions.
To encourage exploration - Action noise, parameter noise to actor network
## References
1. DDPG Explained by Aylwin Wei https://www.youtube.com/watch?v=oydExwuuUCw
2. https://spinningup.openai.com/en/latest/algorithms/ddpg.html