This is a place for questions and discussions about the R rms package and for archived discussions arising from Frank Harrell’s Regression Modeling Strategies full or short course and for regression modeling topics from the MSCI Biostatistics II course. New non-software questions and discussions about regression modeling strategies should be posted in the appropriate topic in datamethods.org that has been created for each chapter in RMS. Links to these topics are found below. You can also go to datamethods.org/rmsN where N is the RMS book chapter number.
Other Information
R rms Package Frequently Asked Questions
Implementing and interpreting chunk tests / contrasts
Contrasts are differences in some transformation of predicted values—either a single difference or a series of differences that may or may not be multiplicity-adjusted (for confidence intervals). These are obtained with the contrast.rms function (contrast for short) and detailed examples may be obtained by typing ?contrast.rms at the console. Chunk tests are multiple parameter (multiple degree of freedom tests). There are two main ways to obtain these composite tests:
- Fitting a full model and a submodel and getting a likelihood ratio test (e.g., using the
lrtestfunction) to test the difference between models, which tests for the joint importance of all the variables omitted from the smaller model. - Using the
anova.rmsfunction (short nameanova).
There are two main types of “chunks” used by anova.rms:
- Testing the combined impact of all the components of a predictor that requires more than one parameter to appear in the model. Examples: combined effect of k-1 indicator variables for a k-category predictor; combined effects of linear and nonlinear pieces of a splined continuous predictor. This is done automatically by
anova(fit). - Testing the combined impact of multiple predictors. This is done by running for example
anova(fit, age, sex)to get a combined test—a test of H_0: neither age nor sex is associated with Y. Any interaction terms involving age and sex are automatically included in the test as are any nonlinear terms for age.
Difference between standard R modeling functions and the corresponding rms functions — when to use one or the other?
rms fitting functions exist for many of the most commonly-used models. Some of them such as lrm and orm are more efficient than standard R function counterparts. The rms functions give rise to automatic tests of linearity from anova and much easier-to-do model validation using resampling methods. rms functions make graphics of predicted values, effects, and nomograms easier to make. They also allow you to use latex(fit) to state the fitted model algebraically for easier interpretation.
Some of the fitting functions such as lrm and especially orm run faster than built-in R functions. rms functions also have print methods that include statistical indexes that are tailored for the particular model such as R^2_\text{adj} and rank measures of predictive discrimination.
Unlike standard R which uses summary methods to get standard errors and statistical tests for model parameters, print methods in rms do that, and the rms summary methods (summary.rms) computes things like inter-quartile-range effects and comparisons against largest reference cells for categorical predictors.
Plotting hazard ratios — partial effects in the Cox PH model
- See this excellent discussion
- Graphical methods to illustrate the nature of the relation between a continuous variable and the outcome when using restricted cubic splines with a Cox proportional hazards model by Peter Austin
The datadist function — purpose and how it is used?
datadist computes descriptive summaries of a series of variables (typically all the variables in the analysis dataset) so that predictions, graphs, and effects are easier to get. For example, when a predictor value is not given to Predict(), the predictor will be set to the median (mode for categorical predictors). When a range is not given for a predictor being plotted on the x-axis to get predicted values on the y-axis, the range defaults to the 10th smallest to the 10th largest predictor value in the dataset. The default range for continuous predictors for getting effect measures such as odds ratios or differences in means is the inter-quartile-range. All the the summary statistics needed for these are computed by datadist and stored in the object you assign the datadist result to. You have to run options(datadist=...) to let rms know which object holds the datadist results.
Arguments and details of functions for missing data: transcan and aregImpute
transcan is used for nonlinear principal component analysis and for single imputation. aregImpute is used for multiple imputation (preferred). There are many arguments and many examples of using these functions in their help files and also in case studies in the RMS text and course notes.
When to use data.table vs standard R data.frames?
data.table is highly recommended for a variety of reasons related to the ease of manipulating data tables in simple and complex ways. All data tables are data frames, so data tables are valid inputs to rms functions.
Working with labels and how they interact with rms functions
labels are typically specified with the Hmisc upData function or with label(x) <- 'some label'. Labels are automatically created with various dataset imports. Labels are used for several rms functions, primarily with axis labels. When a variable has a units attribute in addition to a label, the units appear in a smaller font at the end of the label when forming an axis label. For some functions there is an option to use variable names in place of labels in the output.