RMS Describing, Resampling, Validating, and Simplifying the Model

Regression Modeling Strategies: Describing, Resampling, Validating, and Simplifying the Model

This is the fifth of several connected topics organized around chapters in Regression Modeling Strategies. The purposes of these topics are to introduce key concepts in the chapter and to provide a place for questions, answers, and discussion around the chapter’s topics.

Overview | Course Notes

This is a key chapter for really understanding RMS.

Section 5.1.1, pp.103

  • “Regression models are excellent tools for estimating and inferring associations between X and Y given that the “right” variables are in the model.” [emph. dgl]
  • [Credibility depends on]…design, completeness of the set of variables that are thought to measure confounding and are used for adjustment, … lack of important measurement error, and … the goodness of fit of the model.”
  • …“These problems [interpreting results of any complexity] can be solved by relying on predicted values and differences between predicted values.”

Other important concepts for the chapter:

  • Effects are manifest in differences in predicted values
  • The change in Y when Xj is changed by an amount that is subject matter relevant is the sensible way to consider and represent effects
  • Error measures and prediction accuracy scores include (i) Central tendency of prediction errors, (ii) Discrimination measures, (iii) Calibration measures.
  • Calibration-in-the-large compares the average Y-hat with the average Y (the right result on average). Calibration-in-the-small (high-resolution calibration) assesses the absolute forecast accuracy of the predictions at individual levels of Y-hat. And calibration-in-the-tiny (patient specific).
  • Predictive discrimination is the ability to rank order subjects; but calibration should be a prior consideration.
  • “Want to reward a model for giving more extreme predictions, if the predictions are correct”
  • Brier Score good for comparing two models but absolute value of Brier score is hard to interpret. C-index OK for evaluating one model, but not for comparing models (C-index not sensitive enough).
  • “Run simulation study with no real associations and show that associations are easy to find.” The Bootstrapping Ranks of Predictors (rms.pdf notes section 5.4 and figure 5.3 is an elegant demonstration of “uncertainty”. This improves intuition and reveals our biases toward “over-certainty.”

—Drew Levy

Additional links


Q&A From May 2021 Course

  1. Brier Scaled as an alternative to Brier Score for an easier interpretation? Extract from the Steyerberg (2010) manuscript (DOI: 10.1097/EDE.0b013e3181c30fb2).

The Brier score is a quadratic scoring rule, where the squared differences between actual binary outcomes Y and predictions p are calculated: (Y-p)^2. We can also write this in a way similar to the logarithmic score: Y x (1- p)^2 + (1-Y) x p^2. The Brier score for a model can range from 0 for a perfect model to 0.25 for a noninformative model with a 50% incidence of the outcome. When the outcome incidence is lower, the maximum score for a noninformative model is lower, eg, for 10%: 0.1 x (1 - 0.1)^2 + (1 - 0.1) x 0.1^2=0.090. Similar to Nagelkerke’s approach to the LR statistic, we could scale Brier by its maximum score under a noninformative model: Brier scaled = 1 – Brier/Brier max, where Brier max = mean (p) x (1 - mean (p)), to let it range between 0% and 100%. This scaled Brier score happens to be very similar to Pearson’s R2 statistic

  1. Do you have a good source for interpreting R2 type measures in nonlinear context? Say for dose-response? Papers by Maddala et al in econometrics.

  2. Can we use the validate() function in rms to get optimism corrected c-index? Yes; back-compute from Dxy.

  3. I wonder why RMSE hasn’t been mentioned as a measure for model fit –which is often used in ML. It’s fine, for continuous Y, and is related to R^2.

  4. Should we add model-based description to the other 3 regression strategies? Yes

  5. Does anyone have a reference for Steyerberg’s paper studying p > 0.5 not being quite so damaging for variable elimination from a model? I think the RMS book has the reference.

  6. For the losers vs winners example, seems like you are comparing a model with 10 df for the winners with a model with 990 df with the losers. Would R^2 be a fair comparison between these two models? Yes if the 990 model is penalized or the R^2 is overfitting-corrected.

  7. Do you know of any literature that highlights the issues with the Charlson Comorbidity index? We use it a lot and I’d like to have a reference to bring to my supervisor if possible! (Unsure where this question should be placed, but we brought it up in the validation section :slight_smile: ). See the Harrell Steyerberg letter to the editor in J Clin Epi.

  8. :question: Would you recommend bootstrapping ranks of predictors in practice instead of FDR corrected univariate screening?

  9. :question:We mentioned today that c-index performs well in settings with class imbalance, but I found a paper that tries to argue that AUROC is suboptimal in the case of class imbalance (The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, Saito & Rehmsmeier 2015). Am I missing something basic here where c-index is different from AUC?

If I were using repeated Cross Validation to estimate a performance metric (e.g. PPV) that would be used to determine a cutoff of the model score for selecting patients more likely to benefit from treatment in a future trial, what should I use as my confidence intervals for the performance metric? To clarify, I envision a figure with model score (trained from model using continuous versions of biomarkers of course) on the x-axis and performance on the y-axis. I see 2 options:

  1. Empirical distribution of performance metric estimated from each of the repeats of the Cross Validation (this confidence interval would reflect uncertainty in the validation, but not necessarily in the actual performance metric calculated in any given prediction)
  2. Use some version of binomial approximation to estimate confidence interval of performance metric for the median cross validation result.
    Which confidence interval do you think is preferred (if any)?

There are several big issues here. First, the benefit on an absolute risk or life expectancy scale, for almost any treatment, will expand as the base risk increases. This is a smooth mathematical phenomenon and doesn’t necessary require any data to deduce. So depending on what you mean by benefit just choose higher risk patients. Second, there is no threshold no matter how you try to define it. Treatment benefit will be a smooth function of patient characteristics with no sudden discontinuity. Performance metrics don’t relate to any of this. Differential treatment benefit is somewhat distinct from resampling validation.

1 Like

Let me clarify. I am comparing 2 treatments (drug vs placebo). There is an interaction with treatment in the model and a binary outcome. I am using PPV to assess the proportion of actual responders at a given threshold. We are trying to find a reasonable cutoff to select patients with baseline score at an acceptable prevalence level. The x-axis will actually be the difference in score between the two treatment groups, the y-axis will be performance.

No, this will not work. There is no way to tell who responded unless you have a multi-period crossover study and so you observe each patient multiple times under each treatment. PPV only applies to trivial cases where there is a single binary predictor. Cutoffs play no role in anything here, and performance doesn’t either.

1 Like

Ok. Considering we observe an interaction with treatment in our model. How would you recommend using that information to select patients in future trials who would be more likely to respond to treatment?

Simply plot the value of the interacting factor on the x-axis and the estimated hazard ratio, risk difference, etc. on the y-axis. Most useful to patients is two curves showing absolute risks by treatment vs. x=interacting factor. See Avoiding One-Number Summaries of Treatment Effects for RCTs with Binary Outcomes | Statistical Thinking .

Greetings all. I have a process question. I (clinician researcher) am working with statisticians to create a risk prediction model for a binary outcome in a large registry with missing data. We used MICE to impute the missing data, then the ‘fit.mult.impute’ function in ‘rms’ to run the logistic regression using the imputed datasets. We then did internal validation using bootstrapping of this process. Given a little overfitting, we are interested in doing uniform shrinkage of our regression parameters using the bootstrapped calibration slope value. Question is, we cannot find methods to reasonably perform the shrinkage across the imputed datasets. I welcome any and all thoughts on this approach. Thanks in advance.