# Uncertainty in Machine Learning
There are two distinct types of uncertainty in this modeling process: data uncertainty and model uncertainty.
While the model uncertainty can be reduced by training on more data, the data uncertainty is inherent to the data generating process and is irreducible.
## Data Uncertainty
- Data uncertainty arises from the stochastic variability inherent in the data generating process. Its also called *aleatoric uncertainty*.
- For example, the toxicity label y for a comment can vary between 0 and 1 depending on raters’ different understandings of the comment or of the annotation guidelines.
- A learned classifier $f_W(x)$ describes the data uncertainty via its predictive probability, e.g.: $p(y \mid x, W)=\operatorname{sigmoid}\left(f_{W}(x)\right)$
## Model Uncertainty
- Model uncertainty arises from the model’s lack of knowledge about the world, commonly caused by insufficient coverage of the training data. Its also called *epistemic Uncertainty*.
- For example, at evaluation time, the toxicity classifier may encounter neologisms or misspellings that did not appear in the training data, making it more likely to make a mistake.
- A classifier can quantify model uncertainty by using probabilistic methods to learn the posterior distribution of the model parameters: $W \sim p(W)$
- This distribution over $W$ leads to a distribution over the predictive probabilities $p(y \mid x, W)$. As a result, at inference time, the model can sample model weights $\left\{W_{m}\right\}_{m=1}^{M}$ from the posterior distribution $p(W)$, and then compute the posterior sample of predictive probabilities $\left\{p\left(y \mid x, W_{m}\right)\right\}_{m=1}^{M}$. This allows the model to express its model uncertainty through the variance of the posterior distribution $\operatorname{Var}(p(y \mid x, W))$.
## Combined Uncertainty
In practice, it is convenient to compute a single uncertainty score capturing both types of uncertainty. To this end, we can first compute the marginalized predictive probability:
$
p(y \mid x)=\int p(y \mid x, W) p(W) d W
$
This marginalization captures both types of uncertainty.
## Uncertainty Estimation Methods
For deep learning models, these are some common methods for estimating uncertainty taken from Ovadia et al., 2019 (https://arxiv.org/pdf/1906.02530.pdf):
- (Vanilla) Maximum softmax probability (Hendrycks \& Gimpel, 2017)
- (Temp Scaling) Post-hoc calibration by temperature scaling using a validation set [[Calibration#Temperature Scaling]] (Guo et al., 2017)
- (Dropout) [[Monte-Carlo Dropout]] (Gal & Ghahramani, 2016; Srivastava et al., 2015) with rate $p$
- (Ensembles) Ensembles of $M$ networks trained independently on the entire dataset using random initialization (Lakshminarayanan et al., 2017)
- (SVI) Stochastic Variational Bayesian Inference for deep learning (Blundell et al., 2015; Graves, 2011; Louizos \& Welling, 2017, 2016; Wen et al., 2018).
- (LL) Approx. Bayesian inference for the parameters of the last layer only (Riquelme et al., 2018)
- (LLSVI) Mean field stochastic variational inference on the last layer only
- (LL Dropout) Dropout only on the activations before the last layer
## Evaluating Uncertainty Quality
### Negative Log-Likelihood (NLL)
Commonly used to evaluate the quality of model uncertainty on some held out set. Lower is better.
Drawbacks: Although a proper scoring rule (optimum score corresponds to a perfect prediction), it can over-emphasize tail probabilities.
### Brier Score
Proper scoring rule for measuring the accuracy of predicted probabilities. It is computed as the squared error of a predicted probability vector, $p\left(y \mid x_{n}, \boldsymbol{\theta}\right)$, and the one-hot encoded true response, $y_{n}$. That is,
$
\mathrm{BS}=|\mathcal{Y}|^{-1} \sum_{y \in \mathcal{Y}}\left(p\left(y \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)-\delta\left(y-y_{n}\right)\right)^{2}=|\mathcal{Y}|^{-1}\left(1-2 p\left(y_{n} \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)+\sum_{y \in \mathcal{Y}} p\left(y \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)^{2}\right) .
$
Drawbacks: Brier score is insensitive to predicted probabilities associated with in/frequent events.
### [[Calibration#Expected Calibration Error ECE]]
### Predictive Entropy
The smaller the PE, the more confident the model about its predictions
$
P E=-\sum_{c} \mu_{c} \log \mu_{c}
$
### Claibration AUC
A common approach to evaluate a model’s uncertainty quality is to measure its [[Calibration]] performance - whether the model’s predictive uncertainty is indicative of the predictive error.
This metric evaluates uncertainty estimation by recasting it as a binary prediction problem, where the binary label is the model's prediction error $\mathbb{I}\left(f\left(x_{i}\right) \neq y_{i}\right)$, and the predictive score is the model uncertainty. This formulation leads to the uncertainity confusion matrix:
| | | Uncertainty | Uncertainty |
| :--- | :---: | :---: | :---: |
| | | Uncertain | Certain |
| Accuracy | Inaccurate | TP | FN |
| Accuracy | Accurate | FP | TN |
TP - Prediction is inaccurate and the model is uncertain
TN - Prediction is accurate and model is certain
FN - Prediction is inaccurate and model is certain i.e. overconfidence
FP - Prediction is accurate and model is uncertain i.e. underconfidence
Precision - TP/(TP+FP) - fraction of inaccurate examples where model is uncertain
Recall - TP/(TP+FN) - Fraction of uncertain examples where the model is inaccurate
False positive rate (FPR) - Fraction of under-confident examples among correct examples
Accuracy - TN+TP/(TN+TP+FN+FP)
Thus model's calibration performance can be measured using precision-recall curve as *Calibration AUPRC* and ROC curve with *Calibration AUROC*.
---
## References
1. Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation, Kivlichan et al., 2022
2. Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift, Ovadia et al., 2019