The following comments are an archive of comments entered under an older blog platform and related to the article In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion by Drew Levy. Most of the comments were written in 2018.
Bob Carpenter: Thanks for the post. Iâm a big believer in calibration and weâve been pushing that as a baseline for Stan with our model comparison packages. Basically, we want the sharpest predictions for an empirically calibrated model (following Gneiting and Rafteryâs terminology in their JASA calibration paper â I donât know who invented the terminology in the first place).
With full Bayes, we get calibration for free if our model is well specified; too bad they almost never are, so we have to calibrate empirically.
I believe the fascination of the area under the receiver operating characteristic curve (AUC) for machine learning is because they can use radically uncalibrated techniques like naive Bayes (uncalibrated for real data because of false independence assumptions) to measure systems and not have them look totally idiotic.
As an example, I recently (in the last two or three years) had a discussion with a machine learning professor who was predicting failure of industrial components and couldnât figure out how to do so when (a) some components failed every year, and (b) their logistic regression wasnât predicting much higher than 10% chance of failure for any component. I never managed to convince them that their results might just be calibratedâthey seemed to be getting the prevalence of component failures roughly right. Naturally, someone pointed them to AUC, so they went with that to produce a result other than 100% precision, 0% recall.
As Frank points out, precision (positive predictive performance in epidemiology language) and recall (sensitivity) are deficient because they donât count the true negatives. In statistical language, they are not proper scoring rules.
The reason we care about sensitivity and specificity in diagnostic testing, where we eventually have to choose an operating point, or at least choose decisions on a case by case basis is that combined with prevalence, they determine the entire confusion matrix. They also let us tune decisions when the quadrants of the confusion matrix (like false positive and true positive) have different utilities (dollar values, expected life expectancy and quality changes, etc.).
In real applications, we often donât have the luxury of varying thresholds. Instead, we have to work to build guidelines with good operating characteristics. Actions might be patient or item-specific, but theyâre usually not. Usually itâs âgive the patient a mammogram, if anything looks suspicious, escalate to MRI, and then if it still looks suspicious, go with a puncture test, and if thatâs clean come back next yearâ (even though the first two tests have high sensitivity, low specificity and the last is the reverse). Frank can correct me here, but the guidelines Iâm familiar with only tend to indicate when to start the process. For example, women over age 50 have higher population prevalence of breast cancer than those over 40; which causes a test with a fixed sensitivity and specificity to become more precise (higher positive predictive accuracy, in epidemiology terms). My wifeâs been resisting the follow ups âjust to be sureâ (against medical advice!) from sketchy mammograms now that she knows a bit more statistics. Weâve sadly had a friend go through the not unusual [positive, positive, negative] result sequence for mammograms for two years and then die of breast cancer the following year.
In my applied work in natural language, the decisions of our customers almost always meant selecting an unbalanced operating point. For example, our customers for spell checking and automatic question answering wanted high positive predictive performance (they didnât want false positives), whereas our defense department customers doing intelligence analysis wanted high sensitivity (they were OK with a rate of up to 10 to 1 false positives to positives or sometimes higher depending on importance, if the sensitivity was good).
P.S. The popularity of the Brier score (quadratic loss in a simplex) also puzzles me, though itâs at least a proper scoring rule for classification. Also, itâs not newâit came out of statistics in 1950, about the same time as stochastic gradient descent, another technique that ML people often think was a new invention for big data (âbigâ is relative to the size of your computer!).
The problem with log loss is that itâs very flat through most of its operating region, so despite also being a proper scoring rule, is not a very sensitive measure. So maybe Brier scoring is better. Do you guys have any insight into that? Weâre looking for general recommendations for Stan users.
P.P.S. I had the advantage of my second ML paper (circa 1998 or thereabouts) being overseen by a statistician at Bell Labs, so we took some kind of singular value decomposition approach to dimensionality reduction, used some hokey search-based scoring, then calibrated using logistic regression (we didnât quite know enough to build an integrated system). We were building a call routing application which mattered for the enormous call center of the customer (USAA bank).
Frank Harrell: Wow Bob. What rich comments. These comments should be more visable so I suggest you copy them as a new topic on datamethods.org if you have time. What you and colleagues are doing in the Bayesian prediction world is incredible exciting. Thanks for taking the time to write this! Just a few comments: The Brier score has a lot of advantages and there are various decompositions of it into calibration-in-the-large, calibration-in-the-small, and discrimination measures. But its magnitude is still hard to interpret for many, and depends on the prevalance of Y=1. Regarding sens and spec, I feel that even in the best of circumstances for them, they are still not appropriate. Their backwards-looking nature are not consistent with forward-time decision making. And this raises the whole issue of âpositiveâ and ânegativeâ. I try to avoid these terms as much as possible, for reasons stated int he Diagnosis chapter of Biostatistics for Biomedical Research.
My current thinking is that the types of model performance indexes that should be emphasized are a combination of things such as (1) deviance-based measures to describe predictive information/discrimination, (2) explained variation methods related to the variance of predicted values, (3) flexible calibration curve estimation using regression splines with logistic regression or nonparametric smoothers (I still need to understand how this generalizes in the Bayesian world), (4) calibration measures derived from (3). For (4) I routinely use the mean absolute calibration error and the 0.9 quantile of absolute calibration error. One can also temporarily assume a linear calibration and summarize it with a slope and intercept.
Drew Levy: Bob: this comment terrific: it bumps the discussion up to the next level! Thank you.
A key point that you raise and that I will now pay closer attention to is the difference between prediction applications (the specific focus in this post) and âguidelinesâ: ââŚbuild guidelines with good operating characteristics. Actions might be patient or item-specific, but theyâre usually not.â This may be another prevalent source of ambiguity and confusionâsimilar to the subtle difference between prediction and diagnostics.
For our purposes here (clarifying appropriate methods), guidelines might be construed as classification with implications for a subsequent action(s), but also with the qualification that it is a heuristic subject to modification by additional information (e.g., clinical judgement). The guideline admits to incomplete information that is not part of the model rationalizing the guideline; whereas the goal of prediction is to formally include as much of that ancillary information (the structure and the uncertainty/randomness) for accurate individualized prognostication. And âclassificationâ, per se, seems like yet a different thing as well. It is not hard to see how something like âclassification with implications for a subsequent actionsâ gets confused for prediction/prognostication.
What I am sensing is that the confusion about the specific purposes to which ML and prediction modeling are applied is likely broader then just âprediction vs. classificationâ. There is a lot to unpack here. Elucidating the various objectives, and matching them to appropriate methods, requires more thought, more clarityâand hopefully, eventuallyâsome consensus on definition.
Thank you for advancing the discussion with your insight.
Jeremy Sussman: This piece blames machine learning for a problem that 100% belongs to medical researchers and medical statisticians.
MLâs do not use sensitivity and specificity! (More often they use precision and recall, which are even worse, but itâs unfair to blame ML for techniques they donât use.)
In prediction work for medical journals, weâve published with AUROC and calibration measures. We often include some slightly newer ones (Brier score, calibration slope), but weâve never been asked to and no one has ever told us to avoid AUROC or recommended something better.
Please propose things that are better! Ask us to use them. Get them into places like the TRIPOD statement. But this seems unfair.
Frank Harrell: Jeremy there are two reasons that my experience does not at all resonate with yours. First, I serve as a reviewer on a large number of submissions of machine learning exercises for medical journals. I see sensitivity and specificity used very often. Second, precision and recall are improper accuracy scoring rules, i.e., are optimized by a bogus model. Further, it is well known that the c-index (AUROC) is not powerful enough for comparing two models. It is a nice descriptive statistic for a single model, if you want to measure pure discrimination and donât care (for the moment) about calibration. Soon Iâll be writing a new blog article describing what I feel are the best current approaches for measuring added value of new markers, which relates to your plea for recommendations. Some of this is already covered in a recent talk youâll see at the bottom of my blog main page. Explained outcome variation is the watchword.
Drew Levy: Jeremy, thank you for your comment and perspective. First, I would like to say that âblameâ is not the intent or spirit in which the post was shared. We should all be learning together.
Secondly, I regret that you seem to have missed the main thrust of the piece: the ROC is derived of sensitivity and specificity. I believe that the underlying fundamental mechanics of the AUROC are opaque to most who use it. And that for the purposes of evaluating ML that is intended as a prediction_tool in health care, the AUROCâwhich carries through the conditioning of sensitivity and specificity-- does not afford what people think it does.
If in your work you have employed calibration and other methods for evaluation of model performance that have conditioning and information appropriate for prospective prediction, then you are to be applauded. That sets a very good and encouraging example.
You are correct to insist that alternatives should be provided as well. That will be forthcoming.