Biomarker evaluation - c-statistic (AUC) and alternatives

I am currently trying to expand my statistical knowledge in the field of “biomarker evaluation”.
In my setting “biomarkers” are more or less measured values from blood samples, meaning values such as troponin, BNP (brain natriuretic peptide), creatine kinase.

I have already understood well why the comparison of AUC’s (c-statistics) is problematic (low power), even though this is still done in 70% of the literature.
At other, superior possibilities of comparative measures I have already looked at here. This was really insightful and the most helpful source, can’t thank enough for that @f2harrell!!

I am currently finding it very difficult to find good literature to develop this understanding further.
In particular, I find it difficult to articulate the problematic nature of the AUC comparison to my colleagues beyond the purely mathematical factual “low power” argument. I found the papers by Lobo et al. and Halligen et al. helpful. But especially the second one aims in a certain way towards NRI or net benefit. Which, if I understood it correctly, again have the problems of dichotomization or backward probabilities. (If someone can recommend literature discussing this in more detail, I would be very interested :slight_smile:)

To be honest, although I feel like I understand the problem with AUC, ROCs, sensitivity, specificity and so on to a good extent, it is very difficult to argue towards colleagues and especially senior researchers why those problems really matter. To put it bluntly, the opinion (even if well evidenced) of a young PhD student is not as weighty as, for example, the remarks at an EMA presentation where the AUC is described as “an ideal measure of performance because it is cut-off independent”.

Therefore, “guidelines” for the “validation” of biomarkers would be very helpful to me.

A very good example where the argumentation struggle begins is the chain of argumentation: “If there is an increase in c-statistic one have evidence for added predictive information, but if c-statistic does not increase one do not have evidence for no added value (c-statistic is insensitive).”
I have the impression that I do not get this argumentation explained tangibly enough to my clinical colleagues…

Can anyone recommend good and further literature on this “subject”?

Thanks already for everyone trying to support and help me :slight_smile:
Best regards

1 Like

This is a great question and I hope that others will respond here so that we can compile a list of simple reasons why differences in AUC should not be computed. Here’s my attempt.

  • A c-index (AUROC when the response is binary) estimates the probability that the marker levels are in the right direction when one selects a pair of patients, one with the otucome and one without. If the two marker levels are extremely difference, and in the right direction, c gives no more credit than had the two marker values been only 1% apart. This makes c insensitive for quantifying the predictive discrimination of a single marker, and if c is computed on two markers and the two are subtracted the insensitivity adds up. Two markers having c=0.80, 0.81 can have meaningfully different clinical prediction ability.
  • c is equivalent to a Wilcoxon test for whether the distribution of a marker is shifted when comparing patients having the outcome vs. pts not having it. The difference in two $c$s is equivalent to the difference in two Wilcoxon statistics, and no one uses that. For example if one had a comparison of treatment A and B and a comparison of A and C, one does not compare B and C by subtracting the A-B from the A-C Wilcoxon statistic. Instead one does a head-to-head comparison of B and C. Ranks are not things we generally subtract.
  • The gold standard in traditional statistics is the log-likelihood function, and in Bayesian statistics is the log-likelihood + log-prior (log posterior distribution). The log-likelihood is the most sensitive metric for measuring information from the data, and R^2 measures that use the log-likelihood will be consistent with the most powerful tests for added value. With adjusted pseudo R^2 one can easily penalize the measure for the number of opportunities one had to maximize R^2 (e.g., nonlinear terms, multiple markers); no such penalty is available for c.
  • In addition to sensitive log-likelihood-based measures, a sensitive and clinically very interpretable alternative is the back-to-back histogram of predicted risks. How much does a new marker move the risks towards the tails?

The histogram of predicted risks should always be presented and inspected (eyeball test) prior to any calculations.

Simulations provide some insights into the ROC curve AUC. In simulations, the histogram is replaced by a population risk distribution, which fully determines the patient/nonpatient risk distributions, the ROC curve, and the ROC curve AUC. (1) Inspection of the integral formulas for the ROC curve AUC reveals that c-index can be interpreted as a Monte Carlo integration of the formula. (2) For several parametric risk distributions, the ROC curve AUC is a linear function of the standard deviation of the risk distribution.(2) So if the ROC curve AUC is not increased by a new marker, the risks probably haven’t been moved toward the tails. It has often been claimed the ROC curve is insensitive, but this reflects the risk shuffling that must occur on addition of new markers, i.e., the resulting increased risk dispersion of every risk stratum is accompanied by increased dispersion of neighboring risk strata, so the overarching population risk distribution may change minimally. (3)

(1) The Risk Distribution Curve and its Derivatives
(2) Interpretation of the Area Under the ROC Curve for Risk Prediction Models
(3) Why Risk Factors Minimally Change the ROC Curve AUC
link censored by website, but can be found on medrxiv


A great addition to the discussion. One small clarification: differences in AUROC are insensitive when AUROC is estimated nonparametrically (which is the way it’s done 0.97 of the time, I’m guessing). When distributions are assumed and AUROC is estimated parametrically using standard deviations, that will be more sensitive.

1 Like

Hi Tom!

Generally speaking, I think that every performance metric should be used with a clear context of decision-making and a reasonable interpretability.

My point of view:

  • For individual decision-making related to outcome harm vs treatment harm such as going through biopsy, taking antibiotics, taking statins etc… Decision Curve and NB.

  • For organization decision-making (limited resources with no treatment harm without the assumption of treatment harm): PPV / LIFT vs PPCR.

Dichotomization in the context of data-driven optimization leads to loss of information, dichotomization in the context of domain-knowledge incorporates new knowledge and respects the context of the prediction. Choosing a probability threshold (or a range of probability thresholds) leads to meaningful communication with the clinicians: To most patients/clinicians probability threshold of 0.95 is not important if the outcome is a deadly cancer and if the treatment is fairly safe and helpful. The same is true for probability threshold of 0.01 if the outcome is a meaningless bally ache and the treatment is opioids.

NB respects (unlike sens and spec) the flow of time because it does not require something that you don’t know in order to estimate something you do know. I can estimate the benefit of treating-all-patients, treating-none-of-the-patients or treating selectively according to estimated probabilities provided by a prediction model.

I made a presentation about NB if you want to have a different angle about the subject:

The c-statistic (which is equivalent to the AUROC for the binary outcomes, but not for time-to-event or competing-risks setting) is a nice way of exploring the performance of the model - but it will not provide you a clear way of quantifying added predictive information (again) in the right context.

I can’t imagine even one real example of two patients selected randomly given their true outcomes while the clinician has to guess which one of them has the outcome. Sounds more like a TV-Show rather than a real case scenario :sweat_smile:

1 Like

Well stated. One minor issue: when mentioning PPV this implies dichotomization so that a “positive” can be achieved. PPV/NPV have extremely narrow usages so I’d rather just talk about P (probability of disease or outcome of disease).

1 Like

One might interpret PPV as an individual probability of an outcome given a binary variable such as a binary test, in that case I do agree that it is almost never (or maybe never) useful because in real life I always know at least one thing that is continuous in nature (age for start).

My intention was to use PPV the same we use Lift: For a given resource constraint (or more likely range of possible, flexible and yet to be known resource constraints) what is the expected proportion of True-Positives from the subpopulation that was predicted as positives because of the enforced resource-constraint.

The same way one might look at the “interventions-avoided” version of NB and it will always lead to the same decision but with different numbers and different intuition, one might look at PPV vs PPCR and it will lead to the same conclusion as Lift vs PPCR.

Random guess will lead to a PPV that is equal to a Hyper-Geometric distributed variable divided by the size of the target population, it’s like trying to sample randomly “n” patients that is equal to a given resource constraint (Predicted-Positives) from “N” patients from the total population (Total-Number-of-Observations) while “K” stands for the number of Real-Positives. The expected value of the number of True-Positives for a random guess given a constant number of Predicted-Positives (The resources constraint) is nK/N or in confusion-matrix terminology Predicted-PositivesReal-Positives/Total-Number-of-Observations which is equal to Predicted-PositivesPrevalence.
The expected value of the PPV for a random guess is E(True-Positives/Predicted-Positives) = E(True-Positives)/Predicted-Positives = Predicted-Positives
Prevalence/Predicted-Positives = Prevalence.
The lift of a random guess is expected to be 1 from the same reasons.

Perfect Prediction is deterministic and it will lead to Lift of 1/prevalence and PPV of 1 until there are no more events to be found, than the Lift decrease linearly to 1 and the PPV to the Prevalence.

The way I see it, Lift is the most underused performance metric.

1 Like

Agreed. I just would rather PPV only be used in the context of a truly binary test where no risk factors (i.e., no variation in pre-test probability) exists. That’s the only way that the first P in PPV (“positive”) is helpful.

1 Like

They are very good questions Tom … I’ve spend a lot of the last 15 years trying to get my head around evaluating biomarker and trying to communicate with clinicians about them.
My first thought is, at what stage are you at with these biomarkers - is it the first look in clinical data, or are you at the stage of working out how to use them?

ROC curves and AUCs have some use for a first look - but they tell you little about clinical utility. They may help to see which biomarkers could be worth pursuing or not. For a while there I was of the opinion that publishing ROC curves was a waste of time, but I have realised that sometimes their shape can give insight into how a biomarker may have utility (eg to stratify to low risk or high risk). Andre Carrington has taken to drawing a line on ROC curves that identify the “useful area” ( - message me if you want this). However, similar information can be seen with decision curves and I think Uriah’s advice to look at these is worth following up on.

If, though, you are thinking about the application of the biomarker, then AUCs are inadequate as you must compare performance to current practice. I tend to go first to my clinical colleagues and find out what they want to improve - often this is something like an increase in Sensitivity (or NPV) at some decision threshold when they are trying to rule-out a disease. I tend not to address that directly, but it does allow me to work towards it.
First, I will look at the performance of the current practice/model and then compare to the same model + the biomarker (as a continuous variable). Decision curves, risk assessment plots, Frank’s chi2 plots, calibration curves and then metrics like Brier skill, IDI (separately for events and non-events) etc. [I have these in an R package GitHub - JohnPickering/rap: R package for Risk Assessment Plot and reclassification statistical tools].
Second, I will focus on the decision points and see if I the biomarker can improve the diagnostic metrics my clinical colleagues care about. NRI may come into it here (but only if presented separately for those with and without events).

These are generalisations of course. I must admit I’ve not had much success in trying to explain why not AUCs - rather I include them alongside other metrics and graphs to try and explain why/why not the biomarker adds value.


My opinion is the the whole ROC curve is a very indirect way to look at low-to-high-risk stratification.