# Deep Deterministic Policy Gradient (DDPG) Deterministic because only takes the most appropriate action the policy network believe. DDPG is an off-policy algorithm. DDPG can only be used for environments with continuous action spaces. DDPG can be thought of as being deep Q-learning for continuous action spaces. ![[DDPG Algorithm.png]] Based on [[PGT Actor-Critic]] - Uses two networks - Actor(observation) -> action - Critic(observation, action) -> Q (reward + discounted Q_next) Replay buffer stores obs, actions, reward, obs next Training actor network - assume we have trained critic network. Then using the action from actor network get Q value from citic network - maximize Q Training critic network - feed next observation to actor and get next action - get Q next from the critic network - minimize Q - (reward + discounted Q_next) To solve this reliance of the networks on each other to train, DDPG uses a target network. When training the actor, freezes the critic and updates the actor. When training critic, freezes the target network and gets Q_next to do the update. Update the target network using moving average of parameters individual, an instance of [[Multi-Network Training with Moving Average Target]]. Can only start with completely random actions. When actor is being trained, actor is likely to generate be more 'reasonable' actions. To encourage exploration - Action noise, parameter noise to actor network ## References 1. DDPG Explained by Aylwin Wei https://www.youtube.com/watch?v=oydExwuuUCw 2. https://spinningup.openai.com/en/latest/algorithms/ddpg.html