# Uncertainty in Machine Learning There are two distinct types of uncertainty in this modeling process: data uncertainty and model uncertainty. While the model uncertainty can be reduced by training on more data, the data uncertainty is inherent to the data generating process and is irreducible. ## Data Uncertainty - Data uncertainty arises from the stochastic variability inherent in the data generating process. Its also called *aleatoric uncertainty*. - For example, the toxicity label y for a comment can vary between 0 and 1 depending on raters’ different understandings of the comment or of the annotation guidelines. - A learned classifier $f_W(x)$ describes the data uncertainty via its predictive probability, e.g.: $p(y \mid x, W)=\operatorname{sigmoid}\left(f_{W}(x)\right)$ ## Model Uncertainty - Model uncertainty arises from the model’s lack of knowledge about the world, commonly caused by insufficient coverage of the training data. Its also called *epistemic Uncertainty*. - For example, at evaluation time, the toxicity classifier may encounter neologisms or misspellings that did not appear in the training data, making it more likely to make a mistake. - A classifier can quantify model uncertainty by using probabilistic methods to learn the posterior distribution of the model parameters: $W \sim p(W)$ - This distribution over $W$ leads to a distribution over the predictive probabilities $p(y \mid x, W)$. As a result, at inference time, the model can sample model weights $\left\{W_{m}\right\}_{m=1}^{M}$ from the posterior distribution $p(W)$, and then compute the posterior sample of predictive probabilities $\left\{p\left(y \mid x, W_{m}\right)\right\}_{m=1}^{M}$. This allows the model to express its model uncertainty through the variance of the posterior distribution $\operatorname{Var}(p(y \mid x, W))$. ## Combined Uncertainty In practice, it is convenient to compute a single uncertainty score capturing both types of uncertainty. To this end, we can first compute the marginalized predictive probability: $ p(y \mid x)=\int p(y \mid x, W) p(W) d W $ This marginalization captures both types of uncertainty. ## Uncertainty Estimation Methods For deep learning models, these are some common methods for estimating uncertainty taken from Ovadia et al., 2019 (https://arxiv.org/pdf/1906.02530.pdf): - (Vanilla) Maximum softmax probability (Hendrycks \& Gimpel, 2017) - (Temp Scaling) Post-hoc calibration by temperature scaling using a validation set [[Calibration#Temperature Scaling]] (Guo et al., 2017) - (Dropout) [[Monte-Carlo Dropout]] (Gal & Ghahramani, 2016; Srivastava et al., 2015) with rate $p$ - (Ensembles) Ensembles of $M$ networks trained independently on the entire dataset using random initialization (Lakshminarayanan et al., 2017) - (SVI) Stochastic Variational Bayesian Inference for deep learning (Blundell et al., 2015; Graves, 2011; Louizos \& Welling, 2017, 2016; Wen et al., 2018). - (LL) Approx. Bayesian inference for the parameters of the last layer only (Riquelme et al., 2018) - (LLSVI) Mean field stochastic variational inference on the last layer only - (LL Dropout) Dropout only on the activations before the last layer ## Evaluating Uncertainty Quality ### Negative Log-Likelihood (NLL) Commonly used to evaluate the quality of model uncertainty on some held out set. Lower is better. Drawbacks: Although a proper scoring rule (optimum score corresponds to a perfect prediction), it can over-emphasize tail probabilities. ### Brier Score Proper scoring rule for measuring the accuracy of predicted probabilities. It is computed as the squared error of a predicted probability vector, $p\left(y \mid x_{n}, \boldsymbol{\theta}\right)$, and the one-hot encoded true response, $y_{n}$. That is, $ \mathrm{BS}=|\mathcal{Y}|^{-1} \sum_{y \in \mathcal{Y}}\left(p\left(y \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)-\delta\left(y-y_{n}\right)\right)^{2}=|\mathcal{Y}|^{-1}\left(1-2 p\left(y_{n} \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)+\sum_{y \in \mathcal{Y}} p\left(y \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)^{2}\right) . $ Drawbacks: Brier score is insensitive to predicted probabilities associated with in/frequent events. ### [[Calibration#Expected Calibration Error ECE]] ### Predictive Entropy The smaller the PE, the more confident the model about its predictions $ P E=-\sum_{c} \mu_{c} \log \mu_{c} $ ### Claibration AUC A common approach to evaluate a model’s uncertainty quality is to measure its [[Calibration]] performance - whether the model’s predictive uncertainty is indicative of the predictive error. This metric evaluates uncertainty estimation by recasting it as a binary prediction problem, where the binary label is the model's prediction error $\mathbb{I}\left(f\left(x_{i}\right) \neq y_{i}\right)$, and the predictive score is the model uncertainty. This formulation leads to the uncertainity confusion matrix: | | | Uncertainty | Uncertainty | | :--- | :---: | :---: | :---: | | | | Uncertain | Certain | | Accuracy | Inaccurate | TP | FN | | Accuracy | Accurate | FP | TN | TP - Prediction is inaccurate and the model is uncertain TN - Prediction is accurate and model is certain FN - Prediction is inaccurate and model is certain i.e. overconfidence FP - Prediction is accurate and model is uncertain i.e. underconfidence Precision - TP/(TP+FP) - fraction of inaccurate examples where model is uncertain Recall - TP/(TP+FN) - Fraction of uncertain examples where the model is inaccurate False positive rate (FPR) - Fraction of under-confident examples among correct examples Accuracy - TN+TP/(TN+TP+FN+FP) Thus model's calibration performance can be measured using precision-recall curve as *Calibration AUPRC* and ROC curve with *Calibration AUROC*. --- ## References 1. Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation, Kivlichan et al., 2022 2. Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift, Ovadia et al., 2019