# Calibration
Confidence i.e. the output of the softmax classifier should be aligned with the observed probability of how many errors are made. But it is usually not the case, as seen in the figure below:
![[calibration-curve.jpg]]
This shows: Softmax output =/= probability
![[reliability_diagrams.png]]
## Expected Calibration Error (ECE)
ECE measures the difference in expected accuracy and expected confidence.
$
\mathrm{ECE}=\sum_{m=1}^{M} \frac{\left|B_{m}\right|}{n}\left|\operatorname{acc}\left(B_{m}\right)-\operatorname{conf}\left(B_{m}\right)\right|
$
where $n$ is the total number of samples across all bins. Perfect calibration is achieved when $\mathrm{ECE}=0,$ that is $\operatorname{acc}\left(B_{m}\right)=\operatorname{conf}\left(B_{m}\right)$ for all bins $m$.
![[calibration_factors.png]]
From the figure, we can see that [[Depth and Trainability#The effect of depth]], [[Depth and Trainability#The effect of depth in wider architectures]], and [[Normalization#Batch normalization]] — tend to hurt model calibration. Only weight decay seems to improve ECE while improving accuracy.
## Maximum Calibration Error (MCE)
MCE is appropriate for high-risk applications, where the goal is to minimize the worst-case deviation between confidence and accuracy.
$
\mathrm{MCE}=\max _{m \in\{1, \ldots, M\}}\left|\operatorname{acc}\left(B_{m}\right)-\operatorname{conf}\left(B_{m}\right)\right|
$
## Due to overfitting on loss?
In the figure below, we can see that for early epochs, NLL and test error are fairly correlated, but for latter epochs when LR is reduced, the correlation drops and NLL starts to increase again.
![[NLL-overfitting.jpg]]
Is that the cause of reduced calibration? Could be:
- The theory says that a network trained to minimize NLL is actually calibrated, Iff the (absolute) optimum is found.
This proof could be generalized to the following:
- Given a fixed function $f(.),$ and a new function $g(.)$
- If $g(f(.))$ is minimized with NLL it is calibrated (under some conditions)
## Solutions
### Temperature Scaling
Use a temperature parameter in [[Softmax]] to control the over-confidence predictions. Value of T can be estimated as any other hyperparameters.
$
P(\hat{\mathbf{y}})=\frac{e^{\mathbf{z} / T}}{\sum_{j} e^{z_{j} / T}}
$
![[calibration-temperature.jpg]]
### G-layers
G-layers
- Strip any softmax layers from trained network $f$
- Train $g(f(x))$ on a calibration set X, to minimize NLL
- This network is calibrated
![[g-layers.jpg]]
---
## References
1. Guest Lecture, Thomas Mensink, Google Amsterdam
2. On Calibration of Modern Neural Networks https://arxiv.org/abs/1706.04599
3. The Importance of Calibrating Your Deep Production Model http://alondaks.com/2017/12/31/the-importance-of-calibrating-your-deep-model/