C-statistic across the spectrum of a continuous variable

Ahmed_Sayed · May 15, 2025, 3:18pm

Hello everyone,

One question that often comes up is whether a model/variable provides consistent levels of risk-discrimination across various subgroups of interest* (e.g., by sex, race, or other similar variables typically analyzed in a categorical fashion). This is straightforward to tackle by using interaction terms (if necessary) and calculating C-statistics in these subgroups.

When this is desired for continuous variables (e.g., age), the typical approach is to categorize the variable in some manner (e.g., <65 and >=65) and compare C-statistics in these subgroups. The problem with this is the arbitrary nature of dichotomization, its susceptibility to manipulation (using the “right” cutoff resulting in the desired outcome), and statistical imprecision/wider CIs (which become especially problematic when you’re dichotomizing in a less-than-huge dataset).

I’ve often wondered whether there’s some approach of estimating C-statistics (and, more generally, other metrics of model performance) across the spectrum of a continuous variable without needing to dichotomize in this manner.

*Note that the issue in these scenarios is not so much that the effect size of a given marker changes across the spectrum of the continuous variable of interest (this would be straightforward to investigate using an interaction term), but rather that the distribution of the marker changes markedly across the spectrum of said continuous variable (such that it ceases to “discriminate” because it becomes more or less the same in everyone).

One way to approach this I can think of is to refit your model K times, at each point limiting your dataset to only include patients with the continuous variable greater than some threshold (X_k). You would do this with a large enough number of thresholds to the point where you have a reasonably smooth estimate of how the C-statistic changes in patients with increasingly greater values of the continuous variable of interest.

Any advice or pointers to resources would be very welcome!

f2harrell · May 15, 2025, 10:18pm

I haven’t thought this through, but subsetting on an interval of a continuous variable and then computing the c-index will result in a smaller c if the variable being subsetting on is prognostic. And c is a probability about pairs of patients. So I’m not quite getting why a “subset c” would be useful. Is there another general way to state your goal?

Ahmed_Sayed · May 15, 2025, 11:45pm

Thanks for your reply Prof. Harrell.

I wrote up a simulated example of the notion I’m getting at. Suppose X₁ is a positive-only variable that is not associated with Y, our binary outcome of interest. However, it increases the variance of X₂, a variable which does associate with Y. More formally:

X₁ \sim Gamma(\alpha = 1, \theta = 5) (a gamma-distributed variable with a shape parameter of 1 and a scale of 5)
X₂ \sim N(0, \sigma = X₁) (a normally-distributed variable with a mean of 0 and a standard deviation equal to X₁)
Y \sim Bernoulli(\pi), a binary variable whose probability of occurrence (\pi) is dependent on X₂ but not X₁

This makes it so that X₂ has a fan-shaped scatterplot with respect to X₁ (since X₁ determines its standard deviation):

Consequently, when we plot C-statistics of Y (using X₂ as the predictor) at increasing cutoffs of X₁, we get a plot that looks like so:

In the above example, X₁ is useless at predicting Y (the C-statistic will be 0.5 within simulation error); however, it determines the extent to which X₂ is able to predict Y.

I’ve attached the code below in case it helps:

#N of patients
n <- 10000

#Set seed
set.seed(100)

#X1; a continuous positive variable with a mean of 5. Not associated with y but scales x2.
x1 <- rgamma(n, shape = 1, scale = 5)

#X2; a variable that scales with x1. Associated with y.
x2 <- rnorm(n, mean = 0, sd = x1)
plot(x1, x2)

#Y; a binary variable that is only determined by x2.
logits <- x2 + rnorm(n, 0, 10)
probs <- plogis(logits)
y <- rbinom(n, 1, prob = probs)

#Calculate C-statistics
library(survival)
#Overall
concordance(y ~ x2) 
#When x1 is below its mean
concordance(y[x1 > 1] ~ x2[x1 > 1]) #High C-statistic (x2 varies a lot and is thus able to discriminate well)
#When x1 is above its mean
concordance(y[x1 < 1] ~ x2[x1 < 1]) #Low C-statistic (x2 does not vary a lot and is thus unable to discriminate well)

#Get C-statistics at increasingly greater cutoffs of X1
lapply(quantile(x1, seq(0.05, 0.95, 0.01)),
       function(cutoff)
         concordance(y[x1 > cutoff] ~ x2[x1 > cutoff])$concordance
       ) -> c_statistic

#Unlist and plot
c_statistic <- unlist(c_statistic)
plot(c_statistic)

robinblythe · May 16, 2025, 1:04am

I wasn’t aware that the c-statistic was being used in this way - I would have thought maybe the explained variation in Y given by the variable of interest would be more useful here.

This could be a silly idea, but what if you did it like a LOO-CV; left out each value of the continuous variable and recalculated the AUC? You could maybe find some areas where it overperformed by leaving out some groups.

f2harrell · May 16, 2025, 12:05pm

@Ahmed_Sayed nice work. What I’m lacking is an understanding of the fundamental clinical question you are trying to answer, and why as @robinblythe stated, explained outcome variation or some other measure would not be more relevant. If the outcome were continuous I would have thought the following would be more direct: Compute the residual for each person and run a loess smoother relating age to the absolute residual. This estimates absolute prediction error as a function of age.

Ahmed_Sayed · May 16, 2025, 4:46pm

For continuous outcomes that approach definitely sounds quite doable and would tackle the question in a straightforward manner.

The clinical question I had in mind was like so: We have a marker X which ranges (for example) from 0 to 1,000. We know there’s a strong linear association between levels of said marker and one’s risk of death.

This association does not vary with age. As such, when one measures X in younger and older individuals, one finds the same HR for low vs high levels of the marker (let’s say a HR 5.0 for comparing values of 100 vs 900 in both younger and older individuals).

However, the problem is that X varies very little for younger individuals. Almost everyone has values that are very close to 0. Therefore, despite the strong linear association that does not vary with age, X offers little clinical value in younger individuals.

For older individuals, there’s a more uniform spread of values from 0 to 1,000. Therefore, X provides some important clinical value.

Calculating the C-statistic in these 2 groups separately would provide us with an important piece of information: X is much better at risk-discrimination among older vs younger adults. This despite effect sizes being identical for both age-groups.

The question thus becomes how do we analyze the data so as to describe the age-dependent discriminative value of X?

One idea that avoids the need for describing the change in the C-statistic I had in mind might be like the following: Calculate the 25th and 75th percentiles of X across the spectrum of age using quantile regression. Then, graph the HRs comparing the age-dependent percentiles of X. This would make it so the HR actually represents comparisons that are realistic for each age (you’d go from an HR of 1.10 at 20 to an HR of 10.0 at 80; not because X interacts with age but rather because the values of X being compared are very different in those aged 20 vs those aged 80).

Ahmed_Sayed · May 16, 2025, 4:54pm

For continuous variables, I think that’s definitely a viable approach. Thanks for the suggestion! But I’m not sure if it extends to binary variables as well.

I think the LOO might work, but then the question becomes do I leave out one value at a time (I imagine this might be somewhat noisy in smaller datasets)? or do I sort of pick a moving window of values to remove?

At its core my intuition is that it might be similar to the threshold approach I used above where you’re continuously recalculating the C-index by leaving out values above some threshold of X1 (except in this case you’re leaving out particular values of X1 rather than those exceeding some threshold).

f2harrell · May 16, 2025, 9:51pm

I think the approach used here for binary Y is exactly what you need. See the graph towards the end where the clinical utility of measuring cholesterol varies strongly with age: 19 Diagnosis – Biostatistics for Biomedical Research

When possible I recommend framing questions on a per-patient basis, not on pairs of patients as the c-index does.

Ahmed_Sayed · May 16, 2025, 11:43pm

Ahh, thank you for that suggestion! I slightly modified the approach to come up with the following (red lines are estimated 10th and 90th percentiles of predicted probabilities based on quantile regression; blue line is mean prevalence):

It’s easy to appreciate that post-test probability estimated based on X2 (the prognostic marker) varies markedly based on X1 and is really only clinically helpful when X1 is moderate/high (because that’s when predictions start to deviate from the blue line representing the average baseline prevalence). One could also do this using the log-odds instead of probability of course.

To convey uncertainty, I assume one could show the 95% CIs for quantile regression (though I understand those don’t incorporate the original uncertainty in the logistic model’s predictions).

To formally “test” whether there is sufficient evidence that the added predictive value of X2 is dependent on X1, I assume one could report the slope (+ 95% CIs / P-values) for either/both quantile regression lines. Or maybe interquartile differences if the quantile regression uses splines.

f2harrell · May 17, 2025, 11:37am

Wonderful. For an assessment of changes in relative impact (logit scale), formal interaction tests, possibly including nonlinear terms, are called for.

Ahmed_Sayed · May 19, 2025, 4:40pm

In this case, I think an interaction term within the model itself (X1:X2) wouldn’t quite capture the effect since the coefficient of X2 does not depend on X1. Instead, the same coefficient gets multiplied by increasingly larger (negative or positive) values of X2. I think the only way to get at this would be to test the predicted log-odds directly, since the model itself has no notion of the relation between X1 and X2 (that the former increases the variance of the latter).