# Cross entropy Cross entropy between two probability distribution $p$ and $q$ over same underlying events measures the average number of bits needed to identify an event drawn from the set if a coding scheme optimized for $q$ instead of true distribution $p$ is used. $H(p,q) = -\sum_i p_i log_2(q_i)$ If the true probability density p is equal to predicted density q, then cross entropy becomes entropy. $ H(p,q) = H(p) + D_{KL}(p,q) $ The amount of bits that differ cross entropy from entropy is called KL divergence. --- ## References