Advantage Functions - Notes on AI

# Advantage Functions he advantage function A(s_t, a_t) = Q(s_t, a_t) - V(s_t) measures how much better an action is compared to the average value of being in that state. In policy gradient methods, subtracting a baseline b_t(s_t) from the return reduces variance ([[Control Variates]]) while maintaining unbiased gradient estimates: ∇θ log π(a_t|s_t; θ)(R_t - b_t(s_t)). When using the state value V(s_t) as the baseline, this becomes the advantage function since R_t estimates Q(s_t, a_t). The advantage function provides negative feedback for suboptimal actions and positive feedback for better-than-average actions, improving learning efficiency.