Conformal Prediction - Notes on AI

# Conformal Prediction Conformal prediction is a paradigm for creating statistically rigorous uncertainty sets/intervals i.e. a set of possible labels/ranges with theoretical guarantees. One can use conformal prediction with any pre-trained model, such as a neural network, to produce sets that are guaranteed to contain the ground truth with a user-specified probability, such as, say, 90%. It is incredibly useful for critical discriminative predictions in real-world because of its guarantees and controllability through calibration set. ![[conformal_sets_imagenet.png]] > [!info] > Conformal prediction is regarded as "distribution-free" because it is (1) agnostic to the model, (2) agnostic to the data distribution, and (3) valid in finite samples i.e. offers exact finite-sample coverage guarantees for any sample size. Formally, suppose we have images as input and they each contain one of $K$ classes. We begin with a classifier that outputs estimated probabilities for each class: $\hat{f}(x) \in[0,1]^K$. Then, we reserve a moderate number (e.g., 500) of fresh i.i.d. pairs of images and classes unseen during training, $\left(X_1, Y_1\right), \ldots,\left(X_n, Y_n\right)$, for use as calibration data. Using $\hat{f}$ and the calibration data, we seek to construct a prediction set of possible labels $\mathcal{C}\left(X_{\text {test }}\right) \subset\{1, \ldots, K\}$ that is valid in the following sense: $ 1-\alpha \leq \mathbb{P}\left(Y_{\text {test }} \in \mathcal{C}\left(X_{\text {test }}\right)\right) \leq 1-\alpha+\frac{1}{n+1} $ where ( $X_{\text {test }}, Y_{\text {test }}$ ) is a fresh test point from the same distribution, and $\alpha \in[0,1]$ is a user-chosen error rate. In words, the probability that the prediction set contains the correct label is almost exactly $1-\alpha$; this property is called marginal coverage, since the probability is marginal (averaged) over the randomness in the calibration and test points. **The Procedure**: 1. Identify a heuristic notion of uncertainty using the pre-trained model (most common being softmax probabilities) 2. Define the score function $s(x, y) \in \mathbb{R}$. (Larger scores encode worse agreement between $x$ and $y$.) 3. Compute $\hat{q}$ as the $\frac{[(n+1)(1-\alpha)]}{n}$ quantile of the calibration scores $s_1=s\left(X_1, Y_1\right), \ldots, s_n=s\left(X_n, Y_n\right)$. 4. Use this quantile to form the prediction sets for new examples: $ \mathcal{C}\left(X_{\text {test }}\right)=\left\{y: s\left(X_{\text {test }}, y\right) \leq \hat{q}\right\} $ ![[conformal_procedure_code.png]] That is really it! **Why this is so great:** 1. The size of the predicted set become larger when the model is uncertain or the input is intrinsically hard. The size of the set gives you an indicator of the model’s certainty! No more "the model is 82% sure..." nonsense. 2. Calibration set doesn't need to be large! Perhaps low thousand is more than enough, meaning even for a very difficult classification problem if we can get human to label ~2000 data points, we should be good to! 1. ![[conformal_calibration_dataset_size.png]] 3. Can be used to provide guarantee on model errors! 1. For example: type-1 error guarantee to flag as many toxic comments as possible while not flagging more than α proportion of non-toxic comments. (see section 5.4 of [1]). 4. There are several extensions/variations that make it applicable to so many real-world scenarios! 1. Group-balanced or class-conditional (important!) 1. In classification problems, we might similarly ask for coverage on every ground truth class. For example, if we had a medical classifier assigning inputs to class normal or class cancer, we might ask that the prediction sets are $95 \%$ accurate both when the ground truth is class cancer and also when the ground truth is class normal. 2. Outlier detection 3. Prediction under [[Covariate Shift]] 4. Prediction under wider [[Distribution Shift]] 5. Selective prediction — only predict with selective accuracy guarantee ## The Guarantee **Theorem 1** (Conformal coverage guarantee; Vovk, Gammerman, and Saunders). Suppose $\left(X_i, Y_i\right)_{i=1, \ldots, n}$ and ( $X_{\text {test }}, Y_{\text {test }}$ ) are i.i.d. and define $\hat{q}$ and $\mathcal{C}\left(X_{\text {test }}\right)$ as above. Then the following holds: $ P\left(Y_{\text {test }} \in \mathcal{C}\left(X_{\text {test }}\right)\right) \geq 1-\alpha $ **Proof of Theorem 1**. Let $s_i=s\left(X_i, Y_i\right)$ for $i=1, \ldots, n$ and $s_{\text {test }}=s\left(X_{\text {test }}, Y_{\text {test }}\right)$. To avoid handling ties, we consider the case where the $s_i$ are distinct with probability 1. Without loss of generality we assume the calibration scores are sorted so that $s_1<\cdots<s_n$. In this case, we have that $\hat{q}=s_{\lceil(n+1)(1-\alpha)\rceil}$ when $\alpha \geq \frac{1}{n+1}$ and $\hat{q}=\infty$ otherwise. Note that in the case $\hat{q}=\infty$, $\mathcal{C}\left(X_{\text {test }}\right)=\mathcal{Y}$, so the coverage property is trivially satisfied; thus, we only have to handle the case when $\alpha \geq \frac{1}{n+1}$. We proceed by noticing the equality of the two events $ \left\{Y_{\text {test }} \in \mathcal{C}\left(X_{\text {test }}\right)\right\}=\left\{s_{\text {test }} \leq \hat{q}\right\} $ Combining this with the definition of $\hat{q}$ yields $ \left\{Y_{\text {test }} \in \mathcal{C}\left(X_{\text {test }}\right)\right\}=\left\{s_{\text {test }} \leq s_{\lceil(n+1)(1-\alpha)\rceil}\right\} . $ Now comes the crucial insight. By exchangeability of the variables $\left(X_1, Y_1\right), \ldots,\left(X_{\text {test }}, Y_{\text {test }}\right)$, we have $ P\left(s_{\text {test }} \leq s_k\right)=\frac{k}{n+1} $ for any integer $k$. In words, $s_{\text {test }}$ is equally likely to fall in anywhere between the calibration points $s_1, \ldots, s_n$. Note that above, the randomness is over all variables $s_1, \ldots, s_n, s_{\text {test }}$ From here, we conclude $ P\left(s_{\text {test }} \leq s_{\lceil(n+1)(1-\alpha))\rceil}\right)=\frac{\lceil(n+1)(1-\alpha)\rceil}{(n+1)} \geq 1-\alpha $ which implies the desired result. ## References 1. Gentle guide to conformal prediction http://people.eecs.berkeley.edu/~angelopoulos/publications/downloads/gentle_intro_conformal_dfuq.pdf 2. Awesome Conformal Prediction — https://github.com/valeman/awesome-conformal-prediction