Thoughts on Harrell's c-index

I have a series of thoughts about the C-index. This measure estimate the percentage of pairs such that individuals with the worst predicted prognosis (or highest score) get the worst survival.
In such a case, the c-index could be a measure that does not capture in itself an intrinsic property of the model, but a quality in relation to both the performance, and the distribution of sample failures. In other words, the same c-index could be interpreted differently in several situations.
For example, if many individuals fail close to the same point, (eg, the median), even the best possible model might not predict approximately half of the pairs formed on that particular bulk of subjects. However, this failure is not qualitatively similar to that produced when you confuse the prognosis of subjects who fail between more distant periods. Thus a mediocre c-index might not always indicate a bad model, but maybe also a particular distribution. For example, in pairs with survival times between 5-7 months, the model might be unable to predict which subject will fail sooner, but even in that case it may still correctly discriminate that survival will be 6 months, not 3 or 20 months.
Is it possible to take these considerations into account? Does a c-index weighted by the magnitude of the error make sense?

These are good points. But to take magnitude of error into account limits the generality of the c-index and makes it parametric instead of nonparametric. In general we need to move to measures that are more sensitive than c and that take some of the magnitude problem into account. See for example this.


I read the document in the link, I found it very interesting and nice. However, none of these measures seems to attempt to grasp directly the magnitude of the error, in qualitative terms. I wanted to ask you what you think of this proposal for a weighted c-index, maybe it already exists…:

  1. Count all pairs, which would be the denominator.
  2. For the numerator, use the number of weighted classification failures. The weighting factor may be proportional to the magnitude of error. For example, in a 20-month survival time range, if the difference between the failure time of both partners is 1 month, the weighted factor would be 1/20. If one subject fails at baseline, and the other in month 20, the weighted factor would be 1.
  3. The c-index would be 1 minus the quotient between the two.
    Does something like this already exist?

Are you sure you are really interested in the C-statistic here? I mean the examples you give like:

That sounds more like a question of whether the predicted survival probabilities of the model are similar to the observed. In that case, wouldn’t measures like the mean squared or absolute error give more or less the answer you are looking for?

For your proposed new index, what exactly would you consider to be a classification failure? All situations where the case does not get a higher probability compared to the non-case?

1 Like

It was just a humble idea I was thinking about as I left the hospital and walked home. I’ll try to explain it better:
Discrimination, as far as I know, refers to the model’s ability to distinguish between good and bad prognosis patients. However, this refers to the model’s ability to discriminate in a particular population. So it may be that the same model discriminates well in a heterogeneous population, but discriminates poorly in a more homogeneous population.
I have tried to validate prognostic models for some tumors, and I have found that c-indexes may be low (e.g., around 0.620), but calibration curves could still be quite good at several points (e.g., at 6, 12, 18 months, etc).
When the distribution of time failure is evaluated in such cases, it turns out that early and late mortality is infrequent. Most subjects fail at relatively similar time points, in the middle of the survival curve.
Thus, it is most likely that the model that supposedly discriminates poorly may actually distinguish quite well between subjects with early mortality and those with no early mortality, late from middle events, etc… Consequently, most errors that produce a reduced c-index in such examples will tend to be of little importance in pragmatic terms. By not distinguishing between the qualitative consequences of some errors and others, that can lead to a model that is effective in practical terms being considered a low-performance model.
This I believe is different from the measure of the mean error and others, since it can happen that a model with a certain error, is nevertheless efficient to discriminate between some subjects and others.
For this reason, I suggested why not using a weighted c-index, which takes into account both the quantitative and qualitative aspects of discrimination.

I’m not in favor of it, for the reasons given above. But you can use the U-statistic idea like c uses to define any kernel you are interested in.

I would rather concentrate on R^2 - and root mean squared error-like measures.

1 Like

I’m not sure that “is there a better test statistic/measure to use” is really the core issue here. There may be a fundamental mismatch between the model that’s been built, and the questions you’re asking.

If a model is built (trained) on a heterogeneous population but used on a different, more homogeneous population, performance is likely to be different than expected.

The model seemingly predicts good prognosis vs bad prognosis, but you’re wanting it to discriminate based on survival time.