RMS Describing, Resampling, Validating, and Simplifying the Model

Let me clarify. I am comparing 2 treatments (drug vs placebo). There is an interaction with treatment in the model and a binary outcome. I am using PPV to assess the proportion of actual responders at a given threshold. We are trying to find a reasonable cutoff to select patients with baseline score at an acceptable prevalence level. The x-axis will actually be the difference in score between the two treatment groups, the y-axis will be performance.

No, this will not work. There is no way to tell who responded unless you have a multi-period crossover study and so you observe each patient multiple times under each treatment. PPV only applies to trivial cases where there is a single binary predictor. Cutoffs play no role in anything here, and performance doesn’t either.

1 Like

Ok. Considering we observe an interaction with treatment in our model. How would you recommend using that information to select patients in future trials who would be more likely to respond to treatment?

Simply plot the value of the interacting factor on the x-axis and the estimated hazard ratio, risk difference, etc. on the y-axis. Most useful to patients is two curves showing absolute risks by treatment vs. x=interacting factor. See Avoiding One-Number Summaries of Treatment Effects for RCTs with Binary Outcomes | Statistical Thinking .

Greetings all. I have a process question. I (clinician researcher) am working with statisticians to create a risk prediction model for a binary outcome in a large registry with missing data. We used MICE to impute the missing data, then the ‘fit.mult.impute’ function in ‘rms’ to run the logistic regression using the imputed datasets. We then did internal validation using bootstrapping of this process. Given a little overfitting, we are interested in doing uniform shrinkage of our regression parameters using the bootstrapped calibration slope value. Question is, we cannot find methods to reasonably perform the shrinkage across the imputed datasets. I welcome any and all thoughts on this approach. Thanks in advance.


You could do something like this: Multiple imputation and bootstrapping for internal validation - #4 by Elias_Eythorsson . Simply iterize over multiple singly imputed datasets and average the results.

1 Like

I have a general question regarding calibration plots that I can’t find a description of in Regression modelling Strategies, 2nd edition, Clinical Prediction models A Practical Approach to Development, Validation, and Updating, Cross Validated nor in other relevant papers. How should one interpret apparent miscalibration at the extremes of predicted values that are not practically relevant for the intended use of the prediction model?

Let’s say a prediction model has been developed that is based on logistic regression. We supply clinicians with continuous predicted probabilities and perform a decision curve analysis knowing the approximate cut-offs of probability that clinicians will be using to guide their decisions. For argument’s sake, let’s assume that most clinicians will consider giving a treatment if a patient’s predicted probability of experiencing the outcome is larger than 10-20%. The calibration plot for this particular model is optimally calibrated between predicted risks of 0% to 40%, but at and above 50% predicted risk the relationship between predicted risk and the LOWESS smooth of actual risk is almost horizontal. Should we consider this model miscalibrated in light of its intended use?

There is a logical flaw in that setup: you require unanimity of clinicians’ thresholds and you won’t find it.

But to your question you can ignore extreme tails where there are almost no data, with the exception that you’d need to mention that if future predictions ever appear in such tails such predictions are of unknown accuracy.

1 Like

I am trying to get calibration curves for my time-to-event model in an external, “geographically distinct” validation cohort, but I find it difficult to test whether I have done the analysis the right way. Although changing the newdata to the derivation set and also define the survival object from this set gives a near straight diagonal line, it appears the probabilities are not 100% concordant (small deviation), and there is a “overestimation dip” at high pred/obs probabilities. Thus not really sure if the code is working as intended.

Below I try to evaluate calibration at 6 years in the external data. Is the code correct?

fit <- cph(Surv(time,event)~x1+x2, deriv_df, x=TRUE, y=TRUE) # pre-specified model fitted on the derivation data

estim_surv <- survest(fit,
                       newdata=test_df, # geographically distinct cohort but same care setting
                       times=6)$surv # years

ext_val <- val.surv(fit,
             S= Surv(test_df$time , test_df$event)) 

plot(ext_val, xlab="Predicted probability at 6 years", ylab="Observed probability at 6 years", las=1)

Also, does one have to specify u, and e.g. show calibration plots for multiple time points, or is there a way to illustrate calibration across all time points using rms? Both cohorts have ~300 patients of which ~150 have a recorded event within the FU. Many thanks in advance.

1 Like

You appear to be doing this correctly. I take it that your concern comes from an imperfect calibration plot when evaluated on the training sample. This can definitely happen. The overall slope on the log-log S(t) scale will have to be 1.0 in that case but you can see deviations somewhere along the whole curve.

You can’t automatically get validations at multiple time horizons. You’ll have to run those in a loop and save the results and customize a multi-curve plot. Adding add=TRUE, riskdist=FALSE to all but the first call to plot() will help.


Many thanks! Indeed that was my concern, but good to know it does not have to be 100% calibrated when fitting on the same data used to train.

Does the plot() of a val.survh object fit a loess line on the stored values in p vs. actual and add a histogram of p (the riskdist 1-dimensional scatter plot)? I am wondering because I would like to extract the probability values and plot using ggplot since it is more flexible than base.

I have a question on calibration using rms.

Is it OK to use calibrate() to assess how well a model that was not fitted with rms (support-vector machine) is calibrated over the range of predictions?

For the goodness-of-fit evaluation, I set the coefficients of the intercept and y_hat to 0 and 1, respectively, but I am not sure that this is a correct solution. I have tried to define an explicit formula for this purpose, but could not figure how to do it.


calibrate and validate only work with rms fit objects because they have to re-run the same fitting procedure hundreds of times.

I am currently working on internally validating new prognostic models for event free survival using optimism corrected bootstrapping using 500 bootstrap samples. To do this, I had used the validate function from the rms package in r on a univariate Cox Proportional Hazards model using the prognostic index (risk score). I receive the following output for one of my models at 5-years:

index.orig training test optimism index.corrected n
Dxy 0.6986 0.6971 0.6986 -0.0015 0.7001
R2 0.2549 0.2546 0.2549 -0.0003 0.2552
Slope 1.0000 1.0000 1.0039 -0.0039 1.0039
D 0.1112 0.1108 0.1112 -0.0004 0.1116
U -0.0003 -0.0003 0.0002 -0.0005 0.0002
Q 0.1116 0.1112 0.1111 0.0001 0.1115
g 1.4889 1.4872 1.4889 -0.0017 1.4906

From the output, the optimism corrected calibration slope is 1.0039, which is different from the mean coefficient from the univariate model of 0.0422. The rms package documentation mentions the following regarding calibration slopes ‘The “corrected” slope can be thought of as shrinkage factor that takes into account
overfitting.’ Could you please advise on how the calibration slope was calculated using the validate function in r?

Welcome to datamethods Anita. I don’t think we’ve had this topic come up before. The way that overfitting happens is that when we look at extremes of predictions (X\hat{\beta}) we find that very high predictions are too high after we’ve sorted predictions from low to high ( and low predictions are too low). The sorting from low to high is very reliable when you have a single predictor, and not so reliable when you have lots of predictors so that you have to estimate a lot of regression coefficients that go into the sorting. When you have one predictor, the sorting has to be perfect except in the rare case of getting the wrong sign on the regression coefficient. Any index such as D_{xy} that depends only on the rank order of X\hat{\beta} will show no overfitting. Only absolute indexes such as slope and intercept can show any telltale sign of overfitting, and even then this will be very slight.

All of that is assuming X is fitted with a single coefficient, which means you are assuming it has a linear effect. If you don’t want ot assume that, and you fit a polynomial, spline, or fractional polynomial to X you’ll find a bit of overfitting if your number of events is not very large.

My take-home message is that with a single candidate predictor (i.e., you did not entertain any other predictors) it’s best to relax the linearly assumption unless you know from the literature that the effect is linear on the linear predictor scale, fit that model, and check the proportional hazards (or whatever assumption depending on which survival model you are using) assumption, but not do a formal validation.

I’m doing clinical predictive studies with help of logistic
regression and Cox regression using Harrell’s rms and Hmisc packages.
More precisely, I use lrm and cph but also glm(binomial) and
coxph from R and survival.

I have some, probably elementary, questions to help me to proceed. I
use aregImpute for imputation and carry on with fit.mult.impute
to get the averaged model and further use validate and
plot(calibrate) to estimate the “optimism in the performance
measures” and the “calibration” of the averaged model. So, what is
actually bootstrapped (with validate and plot(calibrate)) in this
situation: the original data with missing values?, each and every
imputed dataset?, or something else? I assume that the “optimism in
performance measures” and the “calibration” of the averaged model are
reflected in the result from validate and plot(calibrate) as an
internal validation of the results from the original data with missing

Is there an “easy and meaningful” way to add confidence intervals (or
bands) to the plot(calibrate) graph för lrm?

How can predicted (or fitted) individual survival be calculated
(including the basic survival and the effects of the coefficients for
the individual subject) in cox regression in this situation
(imputation and averaging imputed data to results for the model)?
For cph?
For coxph? I realize that I must correct the standard errors with
the covariance-matrix from the fit.mult.impute for coxph and
glm(binomial), but except for that, how do I get fitted (predicted)
individual survival? After the correction of the standard errors. I
have no longer a model to use for the “predict” function?

With thanks in advance for any help

Fredrik Lundgren

PS. I hope I have coped correctly to the suggestion from f2harrell. DS

1 Like

(I edited all the quote marks to use proper backquotes for function and package names).

Your last question is the easier one. fit.mult.impute makes it so that any estimate or differences in estimates (contrasts) will automatically get the right standard errors accounting for multiple imputation. To my knowledge no one has taken this further to propagate imputation uncertainties to survival curve estimates. This is one reason I like full likelihood methods that make this much easier. If you don’t have too many distinct failure times you could the Bayesian ordinal regression capabilities of the rmsb package’s blrm function, which allows for left, right, and interval censoring. Then use posterior stacking with multiple imputation, which is more accurate than using Rubin’s rule in the frequentist domain. But the full semiparametric likelihood method doesn’t scale to more than a few hundred distinct failure times.

In terms of validation in the context of missing values see DOI: 10.1002/sim.8978 and these notes from others:

From Ewout Steyerberg

There have been proposals by Royston in the direction of stacking I

  • here
  • Recently in Stat Med more on doing analyses over all sets here

Daniele Giardiello d.giardiello@nki.nl 2017-03-28

I am Daniele a PhD student currently working with professor Ewout Steyerberg.
Here a list of the papers that can be useful for you:

  • Wahl (2016) - Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation
  • Wood (2015) – The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data

In the simplest way, you could calculate the apparent c-index for each imputed data and average them to get the pooled apparent c-index.
In case of large samples the correction for optimism could be avoided.

1 Like

Thank you for your kind answer.

Evidently, things are more complicated than I thought and I cannot use lrm or cph after fit.mult.impute as I thought. However, if I use lrm or cph on every imputed dataset and then validate (B = how many?) and average the resulting performance indexes and/or the performance optimism from each imputed dataset to a single value (e.g. Dxy or slope) – could that count as a reasonable internal validation of the whole model?

Most of my job can be handled with logistic regression but I also, in some situations, need Cox regression. In coxph you can do predict(cox.obj, type = ‘survival’) to get the individual survival probability. Is there a way to do the same thing with cph?

Large? Where the correction for optimism could be avoided.
The two projects with predictive ambitions look like this. Others mainly hypothesis testing or estimation.

n 3155 4989
Deaths 631 116
Deaths/15 42 8
Continuous var 2 2
Binary var (or 3 categories) 6 7

Large enough?

Again, thanks for your kind answer and I hope I got the backquotes correct this time.

Fredrik Lundgren

Yes the long way 'round that you described should work. You may just have overfitting problems with the 2nd dataset.

Could you please advise on how the calibration slope was calculated using the validate function in r?