I’m writing this post in order to ask for your help and wisdom regarding the external validation of an existing coxph model for a different population than it was originally made.
We started with our cohort, of 300 individuals (a dataset of Time/Event data and covariates). In the original article the authors provide the function for the linear predictor which was used to calculate the linear predictor for each individual (thus, creating the vector “LinearPredictor” in our dataset). Secondly, the authors provide with the So(t) as a function depending only on time (t).
For discrimination, we first estimated the regression coefficient on the linear predictor coefficient by fitting a Cox proportional hazards model with only the linear predictor as follows:
cox.mod ← coxph(Surv(time = Time,event = Event) ~ LinearPredictor, data=dataset)
However, the β of the linear predictor is 0.4 instead of a value close to 1. Does that means that there’s something wrong with our data or approach? Next we calculated the C-Statistic (or Harrell’s C statistic, it’s an honour addressing to the person who concieved this himself) and the (R^2 D) to evaluate fit performance. The results were within acceptable limits.
- The most challenging part, calibration. As mentioned in Royston et al..
3.1) There is a reported baseline survival function and obtained directly from the publication . How is it possible in r to create a Kaplan- Meier‒like estimate of the baseline survival function in our data by fitting a Cox model with no covariates other than the linear predictor, with regression coeffect constrained to 1. while also comparing it to the baseline survival function in the same graph in r?
Our attempt in code was:
t ← seq(0:284)
ti ← (t+0.1)/100
Sot ← 1.0003754 - 0.1131641 * (ti)^2 + 0.0964763 * (ti^2 * log(ti))
kmcurve ← survfit(Surv(time = Time,event = Event) ~ offset(1*LinearPredictor), data=dataset)
3.2) We’ve categorized the subjects in risk groups according to quantiles (16th, 50th, 84th) of their respective LP and obtained their KM curves. However, how is it possible to provide in the same graph the mean predicted survival curve for each group? Every attempt using both survival and rms package using survest() has ended in a failure.
3.3) Lastly, the hardest part. val.surv. in order to compare the predicted survival probability with the observed percentage stratified by deciles and calculate the Kaplan-Meier statistics.
We used the survest to calculate the predicted survival:
u ← 60
coxph ← cph(S~ dataset$LinearPredictor, data = dataset, surv=TRUE, time.inc=u, x=TRUE, y=TRUE)
est ← survest(coxph , newdata=dataset, times = 60)$surv
val ← val.surv(fit=coxph, newdata=dataset, S=Surv(time = dataset$Time,event = dataset$Event), u=60)
plot(val2, lim= c(0.2, 1), xlab=“Observed Survival Probability at 5-year”, ylab = “Predicted Survival Probability at 5-year”)
groupkm(est, S=Surv(time = dataset$Time, event = dataset$Event), m=30,
u=60, pl=TRUE, add=TRUE)
we used the m=30 due to the cohort size, being 300, regardless of their risk stratified by deciles.
Have we used the val.surv correctly? How can we overcome the obstacle of 3.1), 3.2) ?
I deeply thank you for your time and for inspiring me and the newer generations of data scientists.