RMS Discussions

This is a place for questions and discussions about the R rms package and for archived discussions arising from Frank Harrell’s Regression Modeling Strategies full or short course and for regression modeling topics from the MSCI Biostatistics II course. New non-software questions and discussions about regression modeling strategies should be posted in the appropriate topic in datamethods.org that has been created for each chapter in RMS. Links to these topics are found below. You can also go to datamethods.org/rmsN where N is the RMS book chapter number.

Other Information

R rms Package Frequently Asked Questions

Implementing and interpreting chunk tests / contrasts

Contrasts are differences in some transformation of predicted values—either a single difference or a series of differences that may or may not be multiplicity-adjusted (for confidence intervals). These are obtained with the contrast.rms function (contrast for short) and detailed examples may be obtained by typing ?contrast.rms at the console. Chunk tests are multiple parameter (multiple degree of freedom tests). There are two main ways to obtain these composite tests:

  1. Fitting a full model and a submodel and getting a likelihood ratio test (e.g., using the lrtest function) to test the difference between models, which tests for the joint importance of all the variables omitted from the smaller model.
  2. Using the anova.rms function (short name anova).

There are two main types of “chunks” used by anova.rms:

  1. Testing the combined impact of all the components of a predictor that requires more than one parameter to appear in the model. Examples: combined effect of k-1 indicator variables for a k-category predictor; combined effects of linear and nonlinear pieces of a splined continuous predictor. This is done automatically by anova(fit).
  2. Testing the combined impact of multiple predictors. This is done by running for example anova(fit, age, sex) to get a combined test—a test of H_0: neither age nor sex is associated with Y. Any interaction terms involving age and sex are automatically included in the test as are any nonlinear terms for age.

Difference between standard R modeling functions and the corresponding rms functions — when to use one or the other?

rms fitting functions exist for many of the most commonly-used models. Some of them such as lrm and orm are more efficient than standard R function counterparts. The rms functions give rise to automatic tests of linearity from anova and much easier-to-do model validation using resampling methods. rms functions make graphics of predicted values, effects, and nomograms easier to make. They also allow you to use latex(fit) to state the fitted model algebraically for easier interpretation.

Some of the fitting functions such as lrm and especially orm run faster than built-in R functions. rms functions also have print methods that include statistical indexes that are tailored for the particular model such as R^2_\text{adj} and rank measures of predictive discrimination.

Unlike standard R which uses summary methods to get standard errors and statistical tests for model parameters, print methods in rms do that, and the rms summary methods (summary.rms) computes things like inter-quartile-range effects and comparisons against largest reference cells for categorical predictors.

Plotting hazard ratios — partial effects in the Cox PH model

The datadist function — purpose and how it is used?

datadist computes descriptive summaries of a series of variables (typically all the variables in the analysis dataset) so that predictions, graphs, and effects are easier to get. For example, when a predictor value is not given to Predict(), the predictor will be set to the median (mode for categorical predictors). When a range is not given for a predictor being plotted on the x-axis to get predicted values on the y-axis, the range defaults to the 10th smallest to the 10th largest predictor value in the dataset. The default range for continuous predictors for getting effect measures such as odds ratios or differences in means is the inter-quartile-range. All the the summary statistics needed for these are computed by datadist and stored in the object you assign the datadist result to. You have to run options(datadist=...) to let rms know which object holds the datadist results.

Arguments and details of functions for missing data: transcan and aregImpute

transcan is used for nonlinear principal component analysis and for single imputation. aregImpute is used for multiple imputation (preferred). There are many arguments and many examples of using these functions in their help files and also in case studies in the RMS text and course notes.

When to use data.table vs standard R data.frames?

data.table is highly recommended for a variety of reasons related to the ease of manipulating data tables in simple and complex ways. All data tables are data frames, so data tables are valid inputs to rms functions.

Working with labels and how they interact with rms functions

labels are typically specified with the Hmisc upData function or with label(x) <- 'some label'. Labels are automatically created with various dataset imports. Labels are used for several rms functions, primarily with axis labels. When a variable has a units attribute in addition to a label, the units appear in a smaller font at the end of the label when forming an axis label. For some functions there is an option to use variable names in place of labels in the output.

7 Likes

Here is the first question of the RMS 2020 short course.

Is it fair to say that the overarching objectives of regression models in general (and RMS in particular) are (1) to develop models that will make accurate predictions of responses for future observations, (2) to properly estimate and represent uncertainty, and (3) support valid inference (whatever that actually is)? And that everything else follows from these and are the necessary details?

If not, how can this proposition be improved?

2 Likes

In the session, we talked about the SNR (signal-to-noise ratio) and it was pointed out that machine learning may be best where this is very high or infinite such as in games or in visual processing. This might be a qualifier to add.

2 Likes

In reference to course philosophy and the perspective that models are usually the best descriptive statistics (because descriptive statistics do not work in higher dimensions) I was wondering on what side of the inferential model vs descriptive statistics divide you’d place techniques like multidimensional scaling. Do you see value in such techniques? I’m thinking in particular of things like ordination in ecology and situations where signal-to noise very low in high dimensional setting (and still need to assess/summarize in a reasonable way).

Can you please clarify regarding the usage of regression splines based on the discussion from today’s lecture. Is the consideration for how many knots to include for each continuous term in a regression model on the same scale as to how many covariates one can include (i.e. how many degrees of freedom can be afforded)? From the lecture it seemed like one can regularly allow at least 3 knots regardless of how many other terms are in the model.

I had a similar question today, Abraham. I assumed, when we were discussing the number of ‘variables’ that could be supported by a particular sample size, this referred to the number of actual parameters (betas) in the model including dummy variables and non-linear terms needed to model all of the variables.

I was wondering if there was a minimum number of unique values you need for a “continuous” x to be suitable for modelling by restricted cubic splines?

I think that techniques such as multidimensional scaling have very good utility especially in relationship to data reduction and better understanding interrelationships among predictors or among multiple outcome variables.

Adding knots is almost like adding a new variable to the model. Perhaps one could say that having a variable with 4 knots (3 d.f.) is like adding 2.5 variables to the model. If nonlinear terms are penalized it would be less. And the ability to use 3 knots will be hampered if there are many continuous variables in the model, unless the sample size is very large.

Sometimes I require 10 or more distinct values. But a slightly different problem occurs if you have a lot of ties at one value, making it hard for default knot placement algorithms to work.

Are data reduction techniques (like multidimensional scaling), in and of themselves, modeling or descriptive statistics (of high dimensional data)?

I’d like to learn more about the problems hierarchical random-effects models have with nonlinear (e.g., logistic) regression, which Prof. Harrell referred to at the end of this afternoon’s class.

Gelman is enthusiastic about hierarchical logistic regression, and I’ve read a lot of his work on the subject, so I’d like to also know more about the limitations and what to worry about in using such models.

Good question. I tend to use them descriptively but they are kind of modeling too.

I’ll bet that there is a single best blog article of his on andrewgelman.com. If you find it and point us to it that would be great.

A lot of Gelman’s blog posts on nonlinear hierarchical modeling focus on specific details rather than the big-picture, but there’s a 2018 paper by Weber, Gelman, Lee, Betancourt, Vehtari, and Racine-Poon on hierarchical nonlinear regression related to meta-analysis and drug development: http://www.stat.columbia.edu/~gelman/research/published/AOAS1122.pdf

Here is a recent blog post from Gelman, saying that nonlinear hierarchical models can be handled by the stan_nlmer() function in the rstanarm package: https://statmodeling.stat.columbia.edu/2020/03/30/fit-nonlinear-regressions-in-r-using-stan_lmer/

And here’s an R Journal paper by Paul Buerkner about the brms package, which has a worked example of fitting nonlinear hierarchical model using brms (see Example 3 on pp. 403ff).

I’ll keep looking for a great blog post by Gelman, but the links above may be good starting points.

3 Likes

Datamethods.org would only allow me to put 2 links in a post, so here’s a link to Buerkner’s R-journal paper that I cite in my previous comment: https://journal.r-project.org/archive/2018/RJ-2018-017/RJ-2018-017.pdf

1 Like

My impression from today’s discussions is that you do not advocate dropping variables from a model nor simplifying non-linearities. But you also talk about bloc tests. Wouldn’t bloc tests be a way to see if clusters of variables and/or non-linearities are necessary? If a simpler model has an equivalent AIC/BIC/c-index, is well calibrated and the lrtest does not favor the complex model would this be enough to justify keeping the simple model? If not, can you please clarify when to use block tests and for what purpose?

1 Like

I have questions related to the rms package and today’s discussions:

  • My take from today is that variable selection is to be avoided if possible and taken with a grain of salt if not. Rms has implemented fastbw which performs fast backward variable selection with an AIC or p-value rule. I am not familiar with the procedure but does this avoid the pitfalls of variable selection mentioned today? If not, what was the motivation behind its implementation?

  • Rms has pentrace to implement penalized regression with continuous and binary outcomes. Is there something similar for cph? And polytomous/ordinal regression?

  • In the RMS book, chapter 5.5.2, you mention a backward selection from the ‘gold standard’ model based on R-squared. Is this implemented in the rms package and can it be specified within validate and calibrate functions?

1 Like

Great question. We’ll soon talk in class about chunk tests with large degrees of freedom being a safe way to possibly delete a pre-specified block of variables.

fastbw implements a fast approximate (exact if doing a linear model) stepdown. It doesn’t avoid any of the pitfalls except that backwards stepdown works slightly better than forward selection.

The survival package implements some sorts of penalization for the coxph function. rms::lrm implements penalization for binary and ordinal regression. Someone has done some work on penalized polytomous regression.

No specific code has been written to do model validation in the context of model approximation. Some of my students have identified one possible problem with model approximation.

1 Like

Laura, These are good questions and your inquiry will be of interest to many, I am sure.

Frank will have a better answer, but until then, let me take a shot at it for you.

Your expressed impressions are largely correct: Frank advises against dropping variables and other simplifications as when these ‘parsimonious’ models are presented they are reported and interpreted as though the final model was correctly divined and the standard errors, CI’s and p-values are all too small —over-stating the certainty and precision of the model. This is another instance where “phantom df’s” come into play: df’s that were part of the model development process in variables and various parameterizations considered and evaluated are ‘disappeared’. This leads to misrepresentation of the real degree of randomness in the (entire) process of predicting responses.

This is the essence of RMS—statistical forensics exposing unwitting statistical misdirection in misstating the real uncertainty. These over-optimistic models tend not to validate well and be reproducible. This stuff is subtle. But important.

“Chunk tests” are used to guide decisions while minimizing the temptations to hack significance by scrutinizing individual terms. It is a more disciplined approach to making judicious conservative modeling decisions.

I hope that is helpful.

Frank, if I have this wrong, please set us straight.

2 Likes

Thank you and Drew very much for all these answers and looking forward to discussions about chunk tests and more!

1 Like

Hi @f2harrell,

Regarding my question about imputing missing data for a new data using a fitted object imp=aregImpute(). You said that it is not possible because the imputation model should include the outcome variable. But when we are doing external validation, we do have the outcome variable. So, my question is: does it make sense to impute the missing data in the validation data using the imputation model trained using the derivation data set? Or would recommend to fit an imputation model to the validation data separately?

Would it be possible to allow the aregImpute() function to perform something like imp.newx = predict(imp, newx)?

Thank you,
Robson

I’d also like to hear more about the advantages/disadvantages of hierarchical (w/ random effects) v. the nonlinear fixed effects models described so far.