# Deep Deterministic Policy Gradient Deterministic because only takes the most appropriate action the policy network believe. Based on [[PGT Actor-Critic]] - Uses two networks - Actor(observation) -> action - Critic(observation, action) -> Q (reward + discounted Q_next) Off policy algorithm Replay buffer stores obs, actions, reward, obs next Training actor network - assume we have trained critic network. Then using the action from actor network get Q value from citic network - maximize Q Training critic network - feed next observation to actor and get next action - get Q next from the critic network - minimize Q - (reward + discounted Q_next) To solve this reliance of the networks on each other to train, DDPG uses a target network. When training the actor, freezes the critic and updates the actor. When training critic, freezes the target network and gets Q_next to do the update. Update the target network using moving average of parameters individual. Can only start with completely random actions. When actor is being trained, actor is likely to generate be more 'reasonable' actions. To encourage exploration - Action noise, parameter noise to actor network --- ## References 1. DDPG Explained by Aylwin Wei https://www.youtube.com/watch?v=oydExwuuUCw