Hi Harrell,
Thank you for creating this interesting forum and for your contributions to this community.
I have a question about the C-index that I would like to confirm with you and others in the community.
Is the original C-index not a proper scoring rule? More specifically, is it improper when we use the predicted survival time (e.g., RMST, median, or mean) as the prediction?
The reason I ask is that I have developed an evaluation toolbox in Python and recommend using Harrell’s C-index with predicted survival time to evaluate survival models (GitHub - shi-ang/SurvivalEVAL: The most comprehensive Python package for evaluating survival analysis models.). However, in recent years, when I submit papers and review other papers, I often receive feedback stating that Harrell’s C-index should not be used because it is not a proper scoring rule.
I believe much of this impression stems from the following two papers:
- The c-index is not proper for the evaluation of t-year predicted risks. Biostatistics, 2019
- Survival Regression with Proper Scoring Rules and Monotonic Neural Networks. AISTATS, 2022
As I understand it, [1] states that the C-index is not proper only when using the survival probability at t years ($S(t | x)$) as the risk score. And [2] states that Antolini’s C-index is not proper—not Harrell’s. To my knowledge, the original C-index (which suggests using predicted survival time) is still proper.
Could you please confirm whether my understanding is correct? Or do you know of any results that show the original C-index is not a proper scoring rule?
Thank you very much for your time and insights!
1 Like
Welcome to datamethods Shi-ang!
No method based on ranks can be a proper scoring rule in my opinion. It doesn’t matter on what scale you state the prediction; that won’t change c except to subtract it from 1.0 depending on the metric chosen. As discussed here I think that it is not a good choice to use concordance probabilities are their equivalence rank correlation coefficients to evaluate regression models, especially when comparing two models. Rank measures are not sensitive enough.
With a proper scoring rule, such as the log-likelihood or, with binary Y, the Brier score, the score is optimized when the model is perfectly well fitting. A rank measure such as c can be very high when the predicted survival probabilities are 10\times too large.
The purpose of c is to provide an easier-to-interpret measure of pure predictive discrimination for a single model. It is used as a descriptive measure and not to make decisions. We need to move more towards pseudo adjusted R^2 (see this) and AIC plus plotting the distribution of predicted outcomes or survival probabilities. When models are well calibrated, wider distributions of predicted risk mean stronger models, as emphasized here.
1 Like
Hi Harrell,
Thanks very much for getting back to me.
As to your comment:
No method based on ranks can be a proper scoring rule in my opinion.
A rank measure such as c can be very high when the predicted survival probabilities are 10 \times too large.
I agree that the Harrell C‑index is not a strictly proper scoring rule – that is, it does not single‑out the true data‑generating distribution as the unique maximiser. Yet I am unsure why you believe no rank‑based metric can ever be (weakly) proper.
My understanding is that a scoring rule is proper if, in expectation, its maximum is achieved by the oracle model – even if other, mis‑calibrated models may tie. A severely mis‑scaled predictor can indeed achieve the same C‑index as the oracle, but so long as the oracle also attains that optimum, does the rule still qualify as proper (albeit not strictly)?
The two papers I listed mention (variants of) the C-index is improper – as the oracle does not achieve the optimal score.
I’m not as clear regarding strong vs. weak, but we have maximum likelihood for a reason. In the absence of a prior it gives the optimum solution. Bayes adds a log prior to the log likelihood. The MLE will not agree with \hat{\beta} that optimize c.
You are saying the (strictly) proper scoring rule is not that important. Interesting, I never thought of it in that way.
A further question comes out of this: as being Bayesian is super helpful in optimization, do you think we should go Bayesian in the evaluation as well?
Go Bayes if you have at least one of the following:
- you know of constraints to place on parameters or combinations of them
- you have external quantitative trustworthy information about that
- you want to express predictions as fuzz instead of the somewhat arbitrary point estimates we’re always using
1 Like
Thanks. This is very interesting, I will definitely think more about it.