Biomarker evaluation - c-statistic (AUC) and alternatives

I am currently trying to expand my statistical knowledge in the field of “biomarker evaluation”.
In my setting “biomarkers” are more or less measured values from blood samples, meaning values such as troponin, BNP (brain natriuretic peptide), creatine kinase.

I have already understood well why the comparison of AUC’s (c-statistics) is problematic (low power), even though this is still done in 70% of the literature.
At other, superior possibilities of comparative measures I have already looked at here. This was really insightful and the most helpful source, can’t thank enough for that @f2harrell!!

I am currently finding it very difficult to find good literature to develop this understanding further.
In particular, I find it difficult to articulate the problematic nature of the AUC comparison to my colleagues beyond the purely mathematical factual “low power” argument. I found the papers by Lobo et al. and Halligen et al. helpful. But especially the second one aims in a certain way towards NRI or net benefit. Which, if I understood it correctly, again have the problems of dichotomization or backward probabilities. (If someone can recommend literature discussing this in more detail, I would be very interested :slight_smile:)

To be honest, although I feel like I understand the problem with AUC, ROCs, sensitivity, specificity and so on to a good extent, it is very difficult to argue towards colleagues and especially senior researchers why those problems really matter. To put it bluntly, the opinion (even if well evidenced) of a young PhD student is not as weighty as, for example, the remarks at an EMA presentation where the AUC is described as “an ideal measure of performance because it is cut-off independent”.

Therefore, “guidelines” for the “validation” of biomarkers would be very helpful to me.

A very good example where the argumentation struggle begins is the chain of argumentation: “If there is an increase in c-statistic one have evidence for added predictive information, but if c-statistic does not increase one do not have evidence for no added value (c-statistic is insensitive).”
I have the impression that I do not get this argumentation explained tangibly enough to my clinical colleagues…

Can anyone recommend good and further literature on this “subject”?

Thanks already for everyone trying to support and help me :slight_smile:
Best regards


This is a great question and I hope that others will respond here so that we can compile a list of simple reasons why differences in AUC should not be computed. Here’s my attempt.

  • A c-index (AUROC when the response is binary) estimates the probability that the marker levels are in the right direction when one selects a pair of patients, one with the otucome and one without. If the two marker levels are extremely difference, and in the right direction, c gives no more credit than had the two marker values been only 1% apart. This makes c insensitive for quantifying the predictive discrimination of a single marker, and if c is computed on two markers and the two are subtracted the insensitivity adds up. Two markers having c=0.80, 0.81 can have meaningfully different clinical prediction ability.
  • c is equivalent to a Wilcoxon test for whether the distribution of a marker is shifted when comparing patients having the outcome vs. pts not having it. The difference in two $c$s is equivalent to the difference in two Wilcoxon statistics, and no one uses that. For example if one had a comparison of treatment A and B and a comparison of A and C, one does not compare B and C by subtracting the A-B from the A-C Wilcoxon statistic. Instead one does a head-to-head comparison of B and C. Ranks are not things we generally subtract.
  • The gold standard in traditional statistics is the log-likelihood function, and in Bayesian statistics is the log-likelihood + log-prior (log posterior distribution). The log-likelihood is the most sensitive metric for measuring information from the data, and R^2 measures that use the log-likelihood will be consistent with the most powerful tests for added value. With adjusted pseudo R^2 one can easily penalize the measure for the number of opportunities one had to maximize R^2 (e.g., nonlinear terms, multiple markers); no such penalty is available for c.
  • In addition to sensitive log-likelihood-based measures, a sensitive and clinically very interpretable alternative is the back-to-back histogram of predicted risks. How much does a new marker move the risks towards the tails?

The histogram of predicted risks should always be presented and inspected (eyeball test) prior to any calculations.

Simulations provide some insights into the ROC curve AUC. In simulations, the histogram is replaced by a population risk distribution, which fully determines the patient/nonpatient risk distributions, the ROC curve, and the ROC curve AUC. (1) Inspection of the integral formulas for the ROC curve AUC reveals that c-index can be interpreted as a Monte Carlo integration of the formula. (2) For several parametric risk distributions, the ROC curve AUC is a linear function of the standard deviation of the risk distribution.(2) So if the ROC curve AUC is not increased by a new marker, the risks probably haven’t been moved toward the tails. It has often been claimed the ROC curve is insensitive, but this reflects the risk shuffling that must occur on addition of new markers, i.e., the resulting increased risk dispersion of every risk stratum is accompanied by increased dispersion of neighboring risk strata, so the overarching population risk distribution may change minimally. (3)

(1) The Risk Distribution Curve and its Derivatives
(2) Interpretation of the Area Under the ROC Curve for Risk Prediction Models
(3) Why Risk Factors Minimally Change the ROC Curve AUC
link censored by website, but can be found on medrxiv


A great addition to the discussion. One small clarification: differences in AUROC are insensitive when AUROC is estimated nonparametrically (which is the way it’s done 0.97 of the time, I’m guessing). When distributions are assumed and AUROC is estimated parametrically using standard deviations, that will be more sensitive.

1 Like

Hi Tom!

Generally speaking, I think that every performance metric should be used with a clear context of decision-making and a reasonable interpretability.

My point of view:

  • For individual decision-making related to outcome harm vs treatment harm such as going through biopsy, taking antibiotics, taking statins etc… Decision Curve and NB.

  • For organization decision-making (limited resources with no treatment harm without the assumption of treatment harm): PPV / LIFT vs PPCR.

Dichotomization in the context of data-driven optimization leads to loss of information, dichotomization in the context of domain-knowledge incorporates new knowledge and respects the context of the prediction. Choosing a probability threshold (or a range of probability thresholds) leads to meaningful communication with the clinicians: To most patients/clinicians probability threshold of 0.95 is not important if the outcome is a deadly cancer and if the treatment is fairly safe and helpful. The same is true for probability threshold of 0.01 if the outcome is a meaningless bally ache and the treatment is opioids.

NB respects (unlike sens and spec) the flow of time because it does not require something that you don’t know in order to estimate something you do know. I can estimate the benefit of treating-all-patients, treating-none-of-the-patients or treating selectively according to estimated probabilities provided by a prediction model.

I made a presentation about NB if you want to have a different angle about the subject:

The c-statistic (which is equivalent to the AUROC for the binary outcomes, but not for time-to-event or competing-risks setting) is a nice way of exploring the performance of the model - but it will not provide you a clear way of quantifying added predictive information (again) in the right context.

I can’t imagine even one real example of two patients selected randomly given their true outcomes while the clinician has to guess which one of them has the outcome. Sounds more like a TV-Show rather than a real case scenario :sweat_smile:

1 Like

Well stated. One minor issue: when mentioning PPV this implies dichotomization so that a “positive” can be achieved. PPV/NPV have extremely narrow usages so I’d rather just talk about P (probability of disease or outcome of disease).

1 Like

One might interpret PPV as an individual probability of an outcome given a binary variable such as a binary test, in that case I do agree that it is almost never (or maybe never) useful because in real life I always know at least one thing that is continuous in nature (age for start).

My intention was to use PPV the same we use Lift: For a given resource constraint (or more likely range of possible, flexible and yet to be known resource constraints) what is the expected proportion of True-Positives from the subpopulation that was predicted as positives because of the enforced resource-constraint.

The same way one might look at the “interventions-avoided” version of NB and it will always lead to the same decision but with different numbers and different intuition, one might look at PPV vs PPCR and it will lead to the same conclusion as Lift vs PPCR.

Random guess will lead to a PPV that is equal to a Hyper-Geometric distributed variable divided by the size of the target population, it’s like trying to sample randomly “n” patients that is equal to a given resource constraint (Predicted-Positives) from “N” patients from the total population (Total-Number-of-Observations) while “K” stands for the number of Real-Positives. The expected value of the number of True-Positives for a random guess given a constant number of Predicted-Positives (The resources constraint) is nK/N or in confusion-matrix terminology Predicted-PositivesReal-Positives/Total-Number-of-Observations which is equal to Predicted-PositivesPrevalence.
The expected value of the PPV for a random guess is E(True-Positives/Predicted-Positives) = E(True-Positives)/Predicted-Positives = Predicted-Positives
Prevalence/Predicted-Positives = Prevalence.
The lift of a random guess is expected to be 1 from the same reasons.

Perfect Prediction is deterministic and it will lead to Lift of 1/prevalence and PPV of 1 until there are no more events to be found, than the Lift decrease linearly to 1 and the PPV to the Prevalence.

The way I see it, Lift is the most underused performance metric.

1 Like

Agreed. I just would rather PPV only be used in the context of a truly binary test where no risk factors (i.e., no variation in pre-test probability) exists. That’s the only way that the first P in PPV (“positive”) is helpful.

1 Like

They are very good questions Tom … I’ve spend a lot of the last 15 years trying to get my head around evaluating biomarker and trying to communicate with clinicians about them.
My first thought is, at what stage are you at with these biomarkers - is it the first look in clinical data, or are you at the stage of working out how to use them?

ROC curves and AUCs have some use for a first look - but they tell you little about clinical utility. They may help to see which biomarkers could be worth pursuing or not. For a while there I was of the opinion that publishing ROC curves was a waste of time, but I have realised that sometimes their shape can give insight into how a biomarker may have utility (eg to stratify to low risk or high risk). Andre Carrington has taken to drawing a line on ROC curves that identify the “useful area” ( - message me if you want this). However, similar information can be seen with decision curves and I think Uriah’s advice to look at these is worth following up on.

If, though, you are thinking about the application of the biomarker, then AUCs are inadequate as you must compare performance to current practice. I tend to go first to my clinical colleagues and find out what they want to improve - often this is something like an increase in Sensitivity (or NPV) at some decision threshold when they are trying to rule-out a disease. I tend not to address that directly, but it does allow me to work towards it.
First, I will look at the performance of the current practice/model and then compare to the same model + the biomarker (as a continuous variable). Decision curves, risk assessment plots, Frank’s chi2 plots, calibration curves and then metrics like Brier skill, IDI (separately for events and non-events) etc. [I have these in an R package GitHub - JohnPickering/rap: R package for Risk Assessment Plot and reclassification statistical tools].
Second, I will focus on the decision points and see if I the biomarker can improve the diagnostic metrics my clinical colleagues care about. NRI may come into it here (but only if presented separately for those with and without events).

These are generalisations of course. I must admit I’ve not had much success in trying to explain why not AUCs - rather I include them alongside other metrics and graphs to try and explain why/why not the biomarker adds value.


My opinion is the the whole ROC curve is a very indirect way to look at low-to-high-risk stratification.


Can you talk about your views on continuous NRI and IDI. Conventional NRI is risk stratification based on grouping. This grouping method (such as using 2.5%, 5%, 10% stratification) is inefficient. However, some would say that continuous NRI and IDI would be better. I didn’t see your take on these metrics and would like to hear your take on them if I could.

Continuous versions are better. See work of Margaret Pepe and others who point out several deficiencies of even the continuous versions. Bottom line: the log likelihood is your friend.

The coefficient of discrimination, the mean risk in patients-mean risk in nonpatients, is a simple, intuitive metric. The IDI is calculated by subtracting the coefficients of discrimination from two models. For any population risk distribution, the coefficient of discrimination is SD^2/(mean(1-mean)), where mean is the mean risk in the population. So the IDI is SD2^2/(mean2(1-mean2))- SD1^2/(mean1(1-mean1)). If one is comparing two models on the same population, the mean risk is the same and the IDI is (SD2^22-SD1^2)/(mean(1-mean)).

Alternatives to the ROC Curve AUC and C-statistic for Risk Prediction Models

1 Like

I read some of her papers. I have not a professional background in statistics, I find it challenging to understand the mathematical theories presented. However, after reading these researches, my understanding is that AUC, NRI, or IDI can offer some information, but most of the time, they are not efficient and come with various issues, whether in classification or continuous versions (the latter being slightly better). On the contrary, performance evaluations of the model (likelihood ratio, R2, and other metrics you mentioned in your book) seem to be a more reliable. Additionally, considerations for costs and the background of study subjects (different improvement in different population) are necessary. Is my understanding correct?

1 Like

As I wrote here it is difficult to understand why so much effort has been put into developing new indexes when we have so many great underutilized old indexes.

I have read the article you wrote. In fact, I have been repeatedly reading it lately, but I’m not smart enough and haven’t fully grasped it yet (So I tried my best to confirm my thoughts with you). I need to explain these things to others, it’s challenging to make others give up an existing idea (using ROC). I recently graduated, and no one cares about my opinion.

By the way, I feel that developing these metrics is for the sake of research funding or writing papers (Of course most researchers are not). I am currently working at a national institution where researchers are extremely enthusiastic about developing frailty index, despite having at least ten or more existing frailty indexes. I don’t understand the difficulty in determining frailty; existing frailty indicators can efficiently evaluate it. I think it’s a waste of resources, but that’s what some researchers tirelessly engage in (in my opinion).

1 Like

You raise excellent points. In the big picture of academic research, a lot of good work is done, but perhaps 1/3 to 1/2 of researchers act as if they are more interested in novelty than in doing the best work. Researchers like to get their names associated with a discovery or a method. Bayesian modelers represent a kind of opposite approach where they strive to use a framework that has existed for centuries while putting their intellectual abilities into model specification. This model specification is based on an attempt to genuinely reflect the data generating process and its complexities and measurement limitations. Model specification is a science and an art and I would rather see all of us put our energy into that. In terms of model performance we need to keep refining methods based on the gold standard log likelihood (for Bayes this equates to the log-likelihood plus the log prior).

1 Like

I’m late to the party responding to this. For a continuous variable like a biomarker or output of a logistic regression model I don’t think NRI should be used at all - indeed I wrote about this for Nephrologists 12 years ago and showed the arbitrariness of the thresholds used for stratification (New Metrics for Assessing Diagnostic Potential of Candidate... : Clinical Journal of the American Society of Nephrology). Having said that, for something like Chronic Kidney Disease where there is established staging then it may have utility, but only if presented separately for those with the event of interest (NRIevent) and without (NRInon-event). The two should not be added because the resultant NRI is prevalence dependent. Similarly, I think the IDIevent and IDInon-event should presented rather than the overall IDI (which assumes events and non-events are equally important).


Thank you for your helpful paper and suggestions. Since discussing this with Professor Harrell, I have searched for many papers on NRI. I will add your paper to my list. NRI is widely used (perhaps because it seems simple and easy to understand?), and I used to think continuous NRI was decent. However, I found there are many criticisms of this approach in many methodological papers. Now, I follow Professor Harrell’s advice, relying primarily on likelihood ratio statistics to assess model improvement and attempting to use the new function of variance of predictive values from the rms package. Your paper indicates that IDI is better than NRI. To be honest, I’m a bit unsure about IDI because it is often discussed alongside NRI, and some people criticize both, which makes me hesitant. However, I lean towards trying various methods (effective methods recommended by statisticians) for a comprehensive evaluation. As you mentioned, in certain situations, IDI and NRI may also have value. Additionally, the perspective on IDI in the context of event and non-event differentiation is valuable, and I will refer to your advice when using IDI.
Please correct me if my understanding is incorrect. I am still learning and very appreciate any opinions and comments.


I am reading the discussion here with great interest. I am currently in a review process for a submitted paper. After some data from cell research and a lot of cellular investigations, we tried to validate the results in clinical data in the last part of our work.
To do this, we calculated a multivariable Cox regression model and quantified the added value of the “new” parameter using Chi-Squared. For all procedures, I was very much guided by your explanations in your blog @f2harrell and showed, for example, prior and post distributions and fraction of new information.
Now we have received the revision and it is exactly this part that is statistically criticized for no reason. The reviewer wrote, that it is necessary and a requirement to calculate and report delta c-statistics, categorical NRI, IDI and p-values for all of those. (p-values for delta-c-index and NRI/IDI??? :woozy_face:)
In the meantime, I have also created a net-benefit analysis, and this seems much more useful to me than tools such as categorical NRI and IDI or the c-statistic in the Cox regression case.

Obviously the easy way, especially for me as a junior researcher, would be to just add the required things, but I am very reluctant to use these things despite their many problems.
So my current plan is to write a response outlining the drawbacks of these quantities and provide appropriate evidence. Which papers are helpful or worth citing here? Or can I even use your blog post as a citation?

I would be very happy for any recommendation of papers or citable resources, maybe one of you also had such a “discussion case” in the past, and already tried to convince the other side. :slightly_smiling_face:

1 Like