This is a place for questions and discussions about the R
rms package and for archived discussions arising from Frank Harrell’s Regression Modeling Strategies full or short course and for regression modeling topics from the MSCI Biostatistics II course. New non-software questions and discussions about regression modeling strategies should be posted in the appropriate topic in
datamethods.org that has been created for each chapter in RMS. Links to these topics are found below. You can also go to
N is the RMS book chapter number.
Contrasts are differences in some transformation of predicted values—either a single difference or a series of differences that may or may not be multiplicity-adjusted (for confidence intervals). These are obtained with the
contrast.rms function (
contrast for short) and detailed examples may be obtained by typing
?contrast.rms at the console. Chunk tests are multiple parameter (multiple degree of freedom tests). There are two main ways to obtain these composite tests:
- Fitting a full model and a submodel and getting a likelihood ratio test (e.g., using the
lrtestfunction) to test the difference between models, which tests for the joint importance of all the variables omitted from the smaller model.
- Using the
anova.rmsfunction (short name
There are two main types of “chunks” used by
- Testing the combined impact of all the components of a predictor that requires more than one parameter to appear in the model. Examples: combined effect of k-1 indicator variables for a k-category predictor; combined effects of linear and nonlinear pieces of a splined continuous predictor. This is done automatically by
- Testing the combined impact of multiple predictors. This is done by running for example
anova(fit, age, sex)to get a combined test—a test of H_0: neither age nor sex is associated with Y. Any interaction terms involving age and sex are automatically included in the test as are any nonlinear terms for age.
Difference between standard R modeling functions and the corresponding
rms functions — when to use one or the other?
rms fitting functions exist for many of the most commonly-used models. Some of them such as
orm are more efficient than standard R function counterparts. The
rms functions give rise to automatic tests of linearity from
anova and much easier-to-do model validation using resampling methods.
rms functions make graphics of predicted values, effects, and nomograms easier to make. They also allow you to use
latex(fit) to state the fitted model algebraically for easier interpretation.
Some of the fitting functions such as
lrm and especially
orm run faster than built-in R functions.
rms functions also have
Unlike standard R which uses
summary methods to get standard errors and statistical tests for model parameters,
rms do that, and the
summary methods (
summary.rms) computes things like inter-quartile-range effects and comparisons against largest reference cells for categorical predictors.
datadist computes descriptive summaries of a series of variables (typically all the variables in the analysis dataset) so that predictions, graphs, and effects are easier to get. For example, when a predictor value is not given to
Predict(), the predictor will be set to the median (mode for categorical predictors). When a range is not given for a predictor being plotted on the x-axis to get predicted values on the y-axis, the range defaults to the 10th smallest to the 10th largest predictor value in the dataset. The default range for continuous predictors for getting effect measures such as odds ratios or differences in means is the inter-quartile-range. All the the summary statistics needed for these are computed by
datadist and stored in the object you assign the
datadist result to. You have to run
options(datadist=...) to let
rms know which object holds the
transcan is used for nonlinear principal component analysis and for single imputation.
aregImpute is used for multiple imputation (preferred). There are many arguments and many examples of using these functions in their help files and also in case studies in the RMS text and course notes.
data.table is highly recommended for a variety of reasons related to the ease of manipulating data tables in simple and complex ways. All data tables are data frames, so data tables are valid inputs to
labels are typically specified with the
upData function or with
label(x) <- 'some label'. Labels are automatically created with various dataset imports. Labels are used for several
rms functions, primarily with axis labels. When a variable has a
units attribute in addition to a
units appear in a smaller font at the end of the label when forming an axis label. For some functions there is an option to use variable names in place of labels in the output.