# Focal Loss
Focal loss is amazing and severely underrated. It down-weights easy/confident samples and amplifies gradients of difficult samples.
$
F L\left(p_t\right)=-\alpha_t\left(1-p_t\right)^\gamma \log \left(p_t\right)
$
It introduces two hyperparameters:
**Alpha (α):** A static class-balancing weight, typically set inversely proportional to class frequency.
**Gamma (γ):** A dynamic focusing parameter that modulates each example's loss based on prediction confidence—higher γ more aggressively suppresses easy/well-classified examples while preserving the loss signal from hard/misclassified ones, regardless of their class.
|$p_t$|Difficulty|CE Loss|$(1-p_t)^2$|Focal Loss|Reduction|
|---|---|---|---|---|---|
|0.95|Very easy|0.051|0.0025|0.00013|~400×|
|0.9|Easy|0.105|0.01|0.00105|100×|
|0.7|Moderate|0.357|0.09|0.032|11×|
|0.5|Hard|0.693|0.25|0.173|4×|
|0.1|Very hard|2.303|0.81|1.865|1.2×|
Because of these two hyperparameters, it's particularly effective when the imbalance correlates with difficulty—which it often does.
The dynamic weighting mechanism can also be thought of as somewhat related to curriculum learning.
## Calibration and Overconfidence Regularization
Focal loss reduces the gradient contribution from examples where the model is already confident. This implicitly discourages the network from pushing logits to extremes just to squeeze out marginal loss improvements on easy examples-which is precisely what causes overconfidence. For example if a sample is difficult i.e. it is implicitly 0.8 instead of 1.0 (but the dataset will have it as 1.0 because of binary labels), CE will focus on matching these two probs and thus will push it to 1.0 to eek out more gains in loss.
Focal loss usually improves calibration over cross-entropy, but it's not guaranteed—it depends on the γ (gamma) parameter and the dataset. Research has shown value around 3 is best for regularizing against overconfidence, and it works best on imbalanced/noisy data.