Classification Metrics and Evaluation

# Classification Metrics and Evaluation There are three factors that influence the performance of decisions made by humans and machines: 1. expertise of the decision maker 2. bias of system i.e. the threshold 3. class balance of the outcomes i.e. prevalence What characteristics should good classification metrics have? 1. **Interpretability:** Should be 1D or 2D for easy understanding - *Without this:* Complex metrics become unusable in practice—stakeholders can't make informed decisions based on something they can't interpret 2. **Decision quality assessment:** Should measure how 'good' the classifier's decisions are - *Without this:* You may optimize for the wrong objective, improving a metric that doesn't reflect actual performance on what matters 3. **Threshold sensitivity:** Should show how performance changes with different decision thresholds - *Without this:* You can't properly tune your classifier or understand the trade-offs between precision/recall, leading to suboptimal deployment decisions 4. **Class balance robustness:** Should reveal how class imbalance affects the outcome - *Without this:* Metrics like accuracy can be misleading (e.g., 99% accuracy on 99% negative class data), giving false confidence in poor models ## Extended Confusion Matrix ![[Extended-confusion-matrix.jpg]] Red - prevalence invariant metrics Blue - prevalence variant metrics Purple - compound metrics ## AUC ROC curve AUC ROC is decision making capacity of a model. FPR(x-axis) vs TPR(y-axis). The AUC roughly describes the total distance the curve is in the up-left direction, *across every possible threshold*. This means that **AUC is invariant to prevalence, and also invariant to threshold**. The ROC AUC is a good measure for expertise, because it ignores prevalence and thresholds. ![[archery-indicator-aucroc.gif]] The ROC curve is great for choosing a threshold. It intuitively shows us the trade off between false positives and false negatives in a way that other methods (like PROC curves) do not. ![[aucroc.gif]] AUC doesn't change with prevalence. But the real world error distribution changes dramatically. So we need another metric along with AUC that says something informative about real world performance that prevalent invariant metrics do not. ![[prevalence-aucroc.gif]] [[Classification Metrics and Evaluation#^66d17f|ROC curves GIF credit]] ## Precision and Recall [[Precision and Recall]] ![[precision-recall.jpg]] Precision is prevalent-variant metric, meaning when class imbalance is high, the ratio of true positives to false positives changes dramatically, so precision is highly affected. Intuition: When throwing a net in a pond with fishes and junk items, if you catch all fishes but also some junk, your recall is high. If you catch only a few fishes but no junk, then your precision is high. ## PR curve A PR curve plots precision on the y-axis and recall on the x-axis. The first thing to recongise here is that ROC curves and PR curves contain the same points – a PR curve is just a non-linear transformation of the ROC curve. ![[PR-curve.jpg]] The problem with PR curve is that we no longer isolate expertise, since one of the metrics (precision) varies with prevalence. Also the area under the curve doesn't add up to so it has no clear probabilistic interpretation. ## F1 Score We will always need at least one prevalence invariant metric and at least one other metric, be it prevalence variant or compound. This metric should describe the imbalance towards false positives, because in low prevalence settings like medicine this is the major problem we will face. F1 represents a single metric for precision and recall. Its a harmonic mean of these two values. Generally it's a trade off of precision or recall to increase F1, and should be dictated by whether False Positive or False Negative is more dangerous to our application. F1 score doesn't have a clear probabilistic interpretation like AUC, so it is unclear what it means. But it highlights the real world weakness of the decision maker in a low prevalence environment. F1 is harmonic mean because mean should be calculated for things that are on the same scale. Precision and Recall both have TP on their numerator but FP and FN on denominator respectively. So we have to take their reciprocal so that they are on the same scale. $ F1 = \frac{2}{\frac{1}{\frac{TP}{TP+FP}}+\frac{1}{\frac{TP}{TP+FP}}} \\ F1 = \frac{2.precision.recall}{precision + recall} \\ $ ## References 1. The philosophical argument for using ROC curves https://lukeoakdenrayner.wordpress.com/2018/01/07/the-philosophical-argument-for-using-roc-curves/ 2. Quest for autism biomarkers faces steep statistical challenges https://www.spectrumnews.org/opinion/viewpoint/quest-autism-biomarkers-faces-steep-statistical-challenges/ ^66d17f