This is a place for questions and discussions about the R rms
package and for archived discussions arising from Frank Harrell’s Regression Modeling Strategies full or short course and for regression modeling topics from the MSCI Biostatistics II course. New non-software questions and discussions about regression modeling strategies should be posted in the appropriate topic in datamethods.org
that has been created for each chapter in RMS. Links to these topics are found below. You can also go to datamethods.org/rmsN
where N
is the RMS book chapter number.
Other Information
R rms
Package Frequently Asked Questions
Implementing and interpreting chunk tests / contrasts
Contrasts are differences in some transformation of predicted values—either a single difference or a series of differences that may or may not be multiplicity-adjusted (for confidence intervals). These are obtained with the contrast.rms
function (contrast
for short) and detailed examples may be obtained by typing ?contrast.rms
at the console. Chunk tests are multiple parameter (multiple degree of freedom tests). There are two main ways to obtain these composite tests:
- Fitting a full model and a submodel and getting a likelihood ratio test (e.g., using the
lrtest
function) to test the difference between models, which tests for the joint importance of all the variables omitted from the smaller model. - Using the
anova.rms
function (short nameanova
).
There are two main types of “chunks” used by anova.rms
:
- Testing the combined impact of all the components of a predictor that requires more than one parameter to appear in the model. Examples: combined effect of k-1 indicator variables for a k-category predictor; combined effects of linear and nonlinear pieces of a splined continuous predictor. This is done automatically by
anova(fit)
. - Testing the combined impact of multiple predictors. This is done by running for example
anova(fit, age, sex)
to get a combined test—a test of H_0: neither age nor sex is associated with Y. Any interaction terms involving age and sex are automatically included in the test as are any nonlinear terms for age.
Difference between standard R modeling functions and the corresponding rms
functions — when to use one or the other?
rms
fitting functions exist for many of the most commonly-used models. Some of them such as lrm
and orm
are more efficient than standard R function counterparts. The rms
functions give rise to automatic tests of linearity from anova
and much easier-to-do model validation using resampling methods. rms
functions make graphics of predicted values, effects, and nomograms easier to make. They also allow you to use latex(fit)
to state the fitted model algebraically for easier interpretation.
Some of the fitting functions such as lrm
and especially orm
run faster than built-in R functions. rms
functions also have print
methods that include statistical indexes that are tailored for the particular model such as R^2_\text{adj} and rank measures of predictive discrimination.
Unlike standard R which uses summary
methods to get standard errors and statistical tests for model parameters, print
methods in rms
do that, and the rms
summary
methods (summary.rms
) computes things like inter-quartile-range effects and comparisons against largest reference cells for categorical predictors.
Plotting hazard ratios — partial effects in the Cox PH model
- See this excellent discussion
- Graphical methods to illustrate the nature of the relation between a continuous variable and the outcome when using restricted cubic splines with a Cox proportional hazards model by Peter Austin
The datadist
function — purpose and how it is used?
datadist
computes descriptive summaries of a series of variables (typically all the variables in the analysis dataset) so that predictions, graphs, and effects are easier to get. For example, when a predictor value is not given to Predict()
, the predictor will be set to the median (mode for categorical predictors). When a range is not given for a predictor being plotted on the x-axis to get predicted values on the y-axis, the range defaults to the 10th smallest to the 10th largest predictor value in the dataset. The default range for continuous predictors for getting effect measures such as odds ratios or differences in means is the inter-quartile-range. All the the summary statistics needed for these are computed by datadist
and stored in the object you assign the datadist
result to. You have to run options(datadist=...)
to let rms
know which object holds the datadist
results.
Arguments and details of functions for missing data: transcan
and aregImpute
transcan
is used for nonlinear principal component analysis and for single imputation. aregImpute
is used for multiple imputation (preferred). There are many arguments and many examples of using these functions in their help files and also in case studies in the RMS text and course notes.
When to use data.table vs standard R data.frames?
data.table
is highly recommended for a variety of reasons related to the ease of manipulating data tables in simple and complex ways. All data tables are data frames, so data tables are valid inputs to rms
functions.
Working with labels and how they interact with rms
functions
label
s are typically specified with the Hmisc
upData
function or with label(x) <- 'some label'
. Labels are automatically created with various dataset imports. Labels are used for several rms
functions, primarily with axis labels. When a variable has a units
attribute in addition to a label
, the units
appear in a smaller font at the end of the label when forming an axis label. For some functions there is an option to use variable names in place of labels in the output.