Confidence score for survival model

I am hoping to get advice on how to construct a confidence score for a Cox PH survival model. I used an electronic medical record (EMR) dataset to train a survival model. For each patient, a time-weighted average of all past clinical data is used to predict survival time. Now, for a patient coming into clinic, I would like to display to the doctor two things:

  1. a predicted survival curve
  2. a confidence score for this patient, indicating how likely is the true survival distribution to be close to the predicted one.

I am aware that bootstrapping can be used to generate confidence intervals for predictive models, as in this link:

But for my use case, I suspect that using data from outside the Cox model could improve the prediction of confidence. With EMR data, some patients will have large amounts of missing data. So the model performance will vary depending on how much data is available for each patient. For example, take nlab, the number of laboratory results in the EMR for this patient. Patients with high nlab will have improved predictive performance, but if nlab were included in the Cox model, its coefficient may be close to 0 and its value may not affect the width of the bootstrapped confidence intervals for the predicted survival curve.

One solution could be to fit a second model, which predicts the Cox model’s squared deviance residual, or maybe log likelihood, for each observation. When the second model predicts a large residual or low log likelihood, we would say that the Cox model’s confidence score is low. But, this seems hacky so I would love to hear if there is a more theoretically justified solution.

Here is some sample code which illustrates the concept of the second model that predicts the performance of the Cox model.

cph.model <- cph(Surv(time, status) ~ frail,
kidney$resid_squared <- residuals(cph.model, type='deviance')**2
conf.model <- lm(resid_squared ~ frail+age, data=kidney)
conf.scores <- predict(conf.model, kidney)

This is quite a challenge, and it’s good to pause about several design issues including how is time zero defined, how do you select patients for the cohort for development of the model, what kind of measurement errors are there, and what are the causes of missing data. If data are “informatively missing” that’s a major problem to tackle all by itself.

The data will not be able to tell you how reliable a point estimate is, but will only be able to tell you the precision in an estimate (standard error or confidence interval). This is related to the amount of support there is in the data for the type of patient being predicted. But there is a way to quantify the limits of prediction in relation to the amount of missing data for the patient being predicted. If you use multiple imputation to develop the model in the first place, and use multiple imputation to make predictions for the new patient, the distribution of predicted values of, say 50 multiple imputations, provides a nice representation of uncertainty that comes from not having complete data on the new patient. A paper that covers multiple imputation for predicting incomplete cases, and compares it with several other methods of prediction on incomplete data, may be found here.

Note that many machine learning practitioners would just throw the data into a black box, not attempt to explain it, and many physicians would use it without question. But all of the issues remain. The black box approach is delusional when data are messy.


Thanks so much for the feedback and ideas. Your point about informatively missing data is important, and these sorts of datasets almost certainly also have informative censoring and other violations of standard models.