RMS Discussions

Thanks, thats what I was looking for exactly.

Section 4.1 of the book advocate for a saturated model for detecting potentially flat relationship. Just to confirm my understanding, If we see certain trend of flatness (small chi-square, large p values) from the anova plot, then the “best practice” to fit the final model if we do not desire data reduction, or the method (varclus, pca) do not yield better results, is to use penalization (pentrace)? And, if we believe the main effect of x1 is strong, but x2 is flat, will penalize all the simple terms be a reasonable approach?

Penalization should be based on prior knowledge and the effective sample size and not on observed relationships. The strategy to which you refer is for the case where neither penalization nor data reduction are being used. This strategy just gives fewer parameters to less promising predictors, while never removing any predictors entirely. It helps to avoid overfitting weak predictors.

Thanks for the explanation. And, If I have some prior belief that the effect one categorical predictor is weak, what adjustment could I make (aside from collapsing rare categories)? Is it possible put penality ony certain variables in the rms package, instead of in the simple, nonlinear fasion?

You can specify a penalty matrix to the lrm and ols functions in rms and can set rows and columns to zero to indicate that their parameters are not being penalized. Look in the help file for lrm.

Hi Dr. Harrell. Can lrm() be accommodated to analyzing data from surveys with a complex sampling design? For example the National Health and Nutrition Examination Survey (NHANES)?

lrm will handle sampling weights but no other aspect of complex sample designs. The R survey package can handle all aspects, I think. Note that the use of weights is controversial as it used only for obtaining unconditional population estimates, and this unconditioning lowers the effective sample size thus increasing variances.

1 Like

Thank you very much for your reply!

For this site just “like” a post and don’t create a separate “thanks” post (although I appreciate the thanks). I’ll remove these last 2 posts later today.

Hi Prof. Harrell,
I am trying to catch up with the course using the recorded videos. Fantastic course by the way.
PMM is the method used and proposed for multiple imputation. What are the advantages or disadvantages compared to other imputation methods such as random forest (available in MICE)? Any reason to choose one over another in certain settings?

In my R Hmisc package the aregImpute function is a good compromise for fitting imputation models because it assumes additivity but not linearity in the effects of the variables you are using to impute. Random forests often overfit, which would overimpute, but if you use RF for predictive mean matching and not for direct regression imputation, the overfitting may be less impactful. The main disadvantages of RF would be lack of interpretability, longer execution time, and too many tuning parameters to have to deal with.

Thank you for the response. With aregImpute I was having problems with one of the outcome variables that had low number of events but I finally removed it.

I have two questions:
-Despite the suggestion of deleting subjects with missing outcomes after MI (Von Hippel 2007), I came across this paper https://academic.oup.com/aje/article/182/6/528/82447. What are your thoughts on using imputed missing outcomes?

-What is the best strategy for rare outcomes/low number of events if your sample size is limited or not huge. Do bayesian approach works better? (I asume it does)

I was not aware of the Sullivan paper. Thanks for pointing this out. Looks like something to worry about.

On your question about rare outcomes I assume that is outside the context of missing data. Bayes will help just a bit through some shrinkage to avoid infinite regression coefficients. But it won’t work any magic.

Dear all,
I would like to perform a fully external validation of a published prediction model which is based on a Cox regression. I went through much of the RMS course videos and read parts of Prof. Harrell’s and also of Prof. Steyerberg’s books on prediction models. I also got some advice on stackexchange https://stats.stackexchange.com/questions/500075/external-validation-of-a-published-cox-ph-model but am still unsure how to effectively proceed. I wonder whether I can use val.surv from the rms package to assess model discrimination and calibration? Is it even possible to apply the coefficients from a cox regression to new data and calculate absolute risks given that the baseline hazard is not estimated (at least not provided by the authors of the model of origin)? Does anyone know of a step-by-step guide on this?
Any guidance would be highly appreciated!

The most important thing to validate is absolute predictive accuracy, and you can’t compute the needed predicted probabilities without getting the baseline survival function S_{0}(t) from the original authors, or at least the function at a few selected time points. And you also need to obtain the covariate centering constants that were used in computation of S_{0}(t).

WIthout that the best you can hope for is to find that the published model has poor predictive discrimination (e.g., c or D_{xy}) so that calibration validation is moot.

Thanks a lot for this response, that is really helpful.

The authors reported the “Average 5-year probability of remaining disease free: 0.974 (Kaplan-Meier estimate at the mean values of the risk factors)”. Is this the same as the baseline hazard? And can I derive the baseline survival function out of this value?

In addition, they provide a table like this:

points of the score outcome risk at 5 years in %
0 0.09
2 0.15
4 0.25
6 0.43
25 47.36

Can I use this information to assess the predictive accuracy?

Kaplan-Meier can’t estimate at the mean covariate values so I assume they mean “adjusted Kaplan-Meier esrimates” AKA Cox model estimates stratifying on treatment. That should help. Do you want to validate their continuous-prediction-model or their categorized version of it (point score)? If the latter then the table may give you what you need to validate.

Thanks for the response. Actually, any method would be great. I guess if I could use heir continuous prediction model would allow me to make use of the tools provided by your rms package.

For the approach with the table I could estimate the predicted 5 year survival risk. But how would I calculate the actual 5 year survival risk? I mean, it is either survival at 5 year or not for each individual, no?

How do you recommend to use a model built from continuous predictors to stratify patients for a new clinical trial. I understand the importance of not dichotomizing any of these predictors during the model building process, but once the model is built and validated and we want to use this model to either stratify or select patients for a future trial is there a recommended approach? Should we try to identify a cutoff (if so, do you have a recommended approach)? Is there an alternative recommended approach?
Thanks.

This is a good question. I assume that the goal is “risk enrichment” to increase the amount of information in the clinical trial sample, i.e., to maximize effective sample size by not randomizing patients are are “too well.” One might also think a risk beyond which it is too late to treat a patient. There are two main approaches I can think of:

  • enumerate a sample of target patients and randomize the m highest risk patients
  • choose a risk cutoff based on an in-depth analysis of the utilities of various designs married to the clinical question

In neither of this approach should one consider cutoffs on individual risk factors, but use the risk estimate as a one-dimensional risk summary. There is no magic way to choose a cutoff, unless you knew the target average risk. For example, if you wanted to randomize patients such that patients in the control arm would have a cumulative incidence of an outcome of c, one may find that randomizing those with risk > c - 0.1 may achieve this.