Calibration error metrics from smooth calibration curve

I am externally validating a competing risks prognostic model (all cause and breast cancer specific death at 5 and 10 years). I have made a smooth calibration curve following guidelines in this article (though I adapted the modelling approach slightly):

Briefly, I used a cause-specific discrete time generalized additive model (GAM) with the sole predictor being the clog-log transformed risk score, modeled using a smoothing spline. The resulting model based “observed” predictions are then plotted vs. predictions from the model being validated. The GAM includes a time interaction term to relax the PH assumption, though this tensor product term is specified with higher shrinkage.
From the fitted calibration curve, we can compute error metrics such as mean prediction error, integrated calibration index etc. However I am unsure about computing confidence intervals for these metrics. My hunch is that we need to bootstrap the whole process i.e. re-fit the model to create the smooth calibration curve (not the model we are validating) and subsequently re-compute each calibration error index of interest each time, before computing quantile based confidence intervals. This appears to take into account uncertainty in fitting the calibration curve itself. However, re-fitting the calibration curve in such a bootstrap procedure might not be possible due to computing time, so I wonder if we might be OK using a normal approximation. This would just fit the calibration curve once and then the standard error (from which we can use the normal approximation to get the confidence intervals) would be obtained from the single variable which represents the absolute difference between model based (observed) risk score and predicted (from the model we are validating) risk score. I would greatly appreciate some advice. Thank you.

1 Like

I think your hunch is correct and bootstrapping the whole process is the way to go. Could you say more about how the GAM is fit (mgcv?) and what standard error you were thinking of using?

One alternative would be a kind of “simulation based inference” approach where you sample coefficient values using the variance-covariance matrix and simulate curves based on those values. This is doable with a binary outcome (e.g. here) but I’m not sure about PH model as you would need the baseline hazard (although perhaps this can be fixed and just the coefficients sampled).

If you’re willing to change the approach to fitting the calibration curve a little, my package, pmcalibration, will calculate bootstrap CIs for a right censored survival outcome. According to the paper you referenced, if you censor everyone at the time you’re generating predictions for, non-proportional hazards probably isn’t a huge issue (as the package currently doesn’t allow for interactions involving time).

2 Likes

Hi Stephen,

Thank you for your thoughts.

Here is some pseudo- R code for the GAM I had in mind:

library(pammtools)
library(mgcv)
library(tidyverse)

 ped_df <-
    as_ped(Surv(time, event) ~ ., data = data) %>% 
    as_tibble()
  
  form_str <- if (exists("cause", where = ped_df))
  {
    ped_status ~
      factor(cause) +
      s(tend, bs = "cr") +
      s(cloglog(marker), bs = "cr") +
      s(tend, by = factor(cause), bs = "cs") +
      s(cloglog(marker), by = factor(cause), bs = "cs") +
      ti(tend, cloglog(marker), bs = "cs")
  }
  else
  {
    ped_status ~
      s(tend, bs = "cr") +
      s(cloglog(marker), bs = "cr") +
      ti(tend, cloglog(marker), bs = "cs")
  }
  
  pam_csh  <- pamm(
    formula = as.formula(form_str),
    discrete = TRUE,
    engine = "bam",
    data = ped_df
  )

  

The reason I thought of this approach is motivated by
rms::val.surv(method='hare',...)

which does not assume proportional hazards.

Refitting the calibration curve just seemed to be the right approach and I’m glad you appear to agree.

My thinking to get an approximate confidence interval for e.g. the integrated calibration index ICI …

Let d be the absolute difference in estimated (model-based estimate from the model used to construct calibration curve) and predicted (from the model we are externally validating) risk (probability). Then, the mean of d is the ICI. The approximate solution I was thinking of for the 95% CI was: mean(d) plus/minus SE*1.96,
where the computation of the SE needs to account for the fact the d is bounded between 0 and 1.

1 Like

Side note: the optimum smoother seems to be an adaptive ordinal regression model. I hope to add this to val.surv. You’ll see a comparison with hare in the link.

2 Likes

I’m not familiar with these pamm models but if it’s a wrapper for a poisson gam it may well be possible to use the simulation based approach mentioned above. Otherwise you could bootstrap a simpler model with administrative censoring at the prediction time.

It might be possible to come up with a closed form SE but one of the above approaches is probably going to be simpler.

Thank you Frank and Stephen for your helpful thoughts.