RMS General Regression

Regression Modeling Strategies: General Aspects of Fitting Regression Models

This is the second of several connected topics organized around chapters in Regression Modeling Strategies. The purposes of these topics are to introduce key concepts in the chapter and to provide a place for questions, answers, and discussion around the chapter’s topics.

Overview | Course Notes

While maybe not the sexiest part of RMS, apprehension of notation can be especially important for accessing important RMS concepts and maneuvers, such as chunk tests involving interactions, and interpreting and expressing effects. If you are new-ish to modeling, skipping over notation might not actually save time and effort.

The concept of the regression function is a function that describes interesting properties of Y that may vary across individuals in the population. C’(Y|X) denotes some function for a property of the distribution of Y conditional on X, expressed as a weighted sum of the X's (the linear predictor).

Interaction (aka, “effect modification”) indicates that the effect of two variables cannot be separated: i.e, the effect of X_1 on Y depends on the level of X_2, and vice versa. This makes it difficult to interpret \beta_1 and \beta_2 in isolation. Importantly, rather than just a difference in slopes, interaction may take the form of a difference in shape (or distribution) of a covariate-response relationship conditional on another variable.

Almost every statistical test is focused on one specific pattern to detect; and therefore inference from every statistical test should be appropriately qualified by this specific pattern.

One of the most valuable maneuvers for general practice is the use of regression splines to relax linearity for continuous covariates. Its low prevalence in general practice does not seem proportionate to its merit. For folks new to rms::rcs()----or now, rms::gTrans()— etc, don’t get intimidated or frustrated by sections 2.4.2 through 2.4.6; just pragmatically emulate FHs practice in the examples (and see links below), and let full understanding follow in good time. ‘Trust in the Force’ of these magical functions: Use splines to avoid categorizing continuous variables or imposing naive linearity.

Categorizing continuous variables leads to loss and distortion of information. Categorizing continuous variables does violence to statistical operating characteristics and is unnecessary. (Don’t do it!)

Likewise for chunk tests (Section 2.6) — the low prevalence in practice does not seem proportionate to its merit. Emulating FHs practice in evaluating complexity of effects (interactions, and spline terms) or redundancy (groups of covariates, etc) has mitigated much statistical complication.

A reasonable approach for modeling and evaluating complex interactions is for each predictor separately, to test simultaneously the joint importance of all interactions involving that predictor.

Additional links

RMS2

4 Likes

Q&A From the May 2021 Course

  1. :question: Greenland (e.g., in Modern Epidemiology, 2008) holds that the concept of the regression model—which comprehends the design, the expression of the model form, and assumptions—is a concept separate from the regression function; and that regression modeling (fitting, estimation, and evaluation) is moreover also a separate notion. Is this correct? Useful? Helpful? Important or pedantic?

  2. I think I am struggling conceptually with how much data you need to actually be able to fit models like this (restricted cubic splines specifically), while more accurate, has to require a very large sample size to uncover associations that aren’t incredibly large. Or am I just being overwhelmed with the complexity behind anything that is truly useful?
    [dgl: Analysis is a craft / art: and it will benefit from thinking about the information in the data and where the information is—subject matter knowledge. A lot has to do with the nature and quality of the response variable, your prior understanding/beliefs about relationships. Actually, splines allow you to use limited data more efficiently; so I think it mis-characterizes the challenge to presume lots of data are necessary for splines/relaxing linearity assumptions. It also has a lot to do with the quality of your inference: how you qualify the patterns you see and what your uncertainty is. Let’s discuss more. FH: This is very well said. We will be talking about sample size on Wednesday.

  3. We talked about categorising in modelling. What about categorising for the purpose of presenting descriptive data? Is it ever useful? [dgl: I recommend you consider how use graphics or good visualizations before categorizing. Categorizing always obscures information; and, more often than not, distorts understanding] Thanks! FH: We’ll see in the Titanic case study that categorizing for the purpose of descriptive statistics can be a complete disaster.

  4. Is it important to evaluate the linearity assumption for each continuous candidate variable, or is it appropriate to model using splines as a default? [dgl-I think starting with splines is a good default, unless you have very strong subject matter presumptions (such as pressure vs volume of a gas)] If you have enough effective sample size you can spline every continuous variable. But we’ll discuss on Wednesday an approach that allows you to use different numbers of parameters for different predictors. And we’re not evaluating the linearity assumption per se but are just allowing for nonlinearity.

  5. Can you share your thoughts on B-Splines? Any guidance on choosing between B-Splines and Restricted Cubic Splines? FH: B-splines are fine, and have less collinearity among terms. But they are more complex, so are not usually worth the trouble unless using a fitting routine that doesn’t handle collinearity well (e.g., SAS).

  6. Is any kind of action based on a model diagnostic considered data dredging due to the inference being conditioned on complete model certainty after alteration (the resulting model wasn’t pre-specified)? So ideally, would the only place for model diagnostics be to assess the appropriateness of an analysis after it’s done? Running diagnostics and changing the model can indeed distort statistical inference. I like to choose models that require fewer diagnostics as a result, e.g., semiparametric models.

  7. 2.4.5. In a model where we use restricted cubic splines of some continuous covariates, and you really need to interpret the effect of these covariates in the model, how can you do this? Is there a workaround to address the issue of interpretability? Could one workaround be to fit a piecewise linear model for ease of interpretation and a restricted cubic spline model for graphical representation? FH: One does not need to fit an alternate model. Splines are interpreted graphically.

  8. Can you elaborate on the difference between natural cubic splines and restricted cubic splines? Are the settings for their application different? FH: They are one and the same, only most people use B splines when fitting natural splines.

  9. Any thoughts on Dennis Cook’s envelope methods, which (as I understand it), reduces the dimension of the predictors while using the information in Y (ie, not masked) to do so in a way that won’t lose the part that matters?

  10. Thoughts on relaxed lasso? FH: It can’t work. The shrinkage has to be carried forward or model calibration will be bad.

  11. Do we make different choices about data reduction and model selection if we’re doing an exploratory (hypothesis-generating) study rather than a confirmatory (hypothesis-testing) study? FH: probably

  12. Can you/do you interpret coefficients of a cubic spline? FH: you don’t

  13. So if you care about estimating (say) the average effect of one variable on the outcome, then you can’t model that variable with cubic splines. FH: I don’t know what “average effect” means.

  14. Do issues of phantom degrees of freedom apply to Bayesian approaches or is it a frequentist specific issue? FH: Good question. Bayes provides a more cohesive approach in which you put extra parameters into the model to absorb lack of fit, and put priors on them.

  15. Can you clarify when it is appropriate to explore for interactions prior to fitting the final model? Seems inappropriate to use the data to inform the presence of interactions between candidate variables, then to add them to the final model using the same data. FH: The only “safe” exploration is to pre-specify a list of interactions and test them all jointly with a chunk test. Then either keep all or remove all.

  16. I’m a bit confused about the use of splines (possibly with interactions) and the idea of estimation. Frank said you shouldn’t spend too much time interpreting individual coefficients. You could do a chunk test of all the different X variables that include your particular X variable of interest, but that is a hypothesis test as I understand it, and I was under the impression we wanted to move away from testing to estimation. If you can’t interpret individual coefficients (which give you an estimate of the effect of that variable on the outcome?), then what inferences can you make? about e.g. the effect of X1 on Y. I suppose that you can make better predictions with your model than you might otherwise have been able to. FH: The main approach here is partial effect plots as we’ll see on Wednesday.

  17. How do you assess fit of splines? FH: primarily we posit a spline that must fit as well as needed within the restriction that we don’t overfit. So if the spline doesn’t fit there is nothing we can do about it anyway.

  18. So there is no metric to formally assess fit? We mainly rigorously check the fit of the overall model, e.g. calibration curves for absolute accuracy. For a single variable the fit is relative, e.g., how bad is a fit with 4 knots compared to putting 6 knots on the variable. But since we should start out with the maximum number of knots we are comfortable with, this is somewhat moot.

  19. In what situations do you use chunk tests? FH: many!

  20. My clinician colleagues are used to seeing results of a continuous variable from a logistic regression written in OR/95% CI per unit of the variable as it is commonly modeled as linear. When modeled as a RCS, how should the results best be communicated (this goes for a manuscript as well) if we are to avoid evaluating/interpreting coefficients? FH: covered on Wednesday.

  21. 2.8. In the example of rates of hospital admission, if one had found that the jump effect was not significant (and assuming testing the jump was not our primary interest), would it make sense to report results from a final model without the jump effect? FH: No. We need to abandon “significance” and we’ll distort inference if we remove “insignificant” predictors. Use the confidence interval for the variable whether it’s “significant” or not.

  22. 2.4.5. If we test for non-linearity in a model fitted with splines and find it significant, but the predictions produced by the spline model are very similar to the predictions produced by the linear model, could it be appropriate to report results of the linear model because the latter it’s much simpler to interpret? FH: It’s OK to use estimates from a linear fit but you must use the confidence band from the spline fit to be accuracy. So you might as well keep the spline fit. The approach you outlined creates model uncertainty.

  23. What is the best way to examine for nonlinearity in our continuous variables? Should this be examined in each variable prior to model building, if so how? Should we assume nonlinearity based on previous literature/knowledge? Should we compare model fit with and without nonlinearity? Other? FH: If you don’t have good reason to assume linearity before looking at the data, allow for nonlinearity. We will go over a strategy where you force some unpromising variables to be linear though.

  24. Are there any important differences between multivariate adaptive regression splines and the approach we have considered? FH: MARS is a little more prone to overfitting but offers some nice interaction modeling capabilities I believe.

  25. In section 2-55 (page 84) there is some code: ‘off ← list (stdpop = mean (d$stdpop)) # offset for prediction (383464.4)’. I couldn’t follow from the comment what this is doing. Could you elaborate? (perhaps similarly) what is the purpose of the offset() function in the GLM formula that follows: ‘f ← Glm (aces ~ offset (log(stdpop)) + rcs (time, 6) , data =d , family = poisson )’? FH: offsets are the way that Poisson regression adjusts for the denominator (normalizes) so that you are then modeling rates and not population-change-dependent counts. The linear predictor is a log rate after adding covariate effects and the offset. [Thank you - I didn’t realise it was related to the fact that it was poisson regression so didn’t do effective searches. I found the explanation here useful too: When to use an offset in a Poisson regression? - Cross Validated ] FH: Thanks for the very helpful link which I’ve added to the course notes.

  26. On page 2-19 (p48 of pdf) I have a Q about notation. What does the capital “I” mean when forcing continuity? (not in the R code, but in the actual regression equation) FH: It just means a 0/1 indicator for condition not holding/condition holding. I should have used the Knuth notation [condition] without the I.

  27. Right, thank you. So in that case it is just an indicator to tell the model whether to include the second slope

  28. Suppose we are doing a case-control trial to test the effectiveness of vaccine 1 for a certain disease. 50% of the subjects who are vaccinated with vaccine 1 are vaccinated with vaccine 2, which is intended for the same disease. The reason they are recommended to be vaccinated sequentially is both of them have distinctive benefits. Is it preferable to restrict the analysis population to vaccine 2 free subjects (and double the study size) or just perform logistic regression adjusting for vaccine 2 (and including interaction term ) ? Does the percentage of vaccine overlap (50% in this example) impact the choice? FH: Probably a better question for datamethods.org.

  29. What are your thoughts on other Information Criteria, throughout the course enthesis has been on avoiding Parsimonious models; Is there any situation you would see yourself using the Bayesian Information Criterion (BIC) over AIC? BIC applies to the case where there is a true, finite dimensional model. I don’t think that’s ever the case in my field. BIC penalizes towards underfitted models.

  30. What is your opinion on linear model trees (such as the algorithm mentioned in https://doi.org/10.1198/106186008X319331)? I think it’s better than regular trees but would need to see added value over the usual flexible modeling.

  31. Instead of considering y as a known value in regression, should we incorporate an error function to account for variability in the response variable? Ordinary linear models already account for this through the usual error term. For other models it depends on what you mean by variability. If it’s measurement error, this an cause problems. If it’s variability due to not accounting for an important variable, the main solution is to account for that variable.

I am attaching some numerical examples to explain my questions.

I have 2 continuous non-linear variables, namely A and B. The use of rcs with 4 knots produced the following output.

Model 1: Surv(Death_Time, Death) ~ A
Model 2: Surv(Death_Time, Death) ~ rcs(A, 4)

L.R. Chisq d.f. P
2.252038e+01 2.000000e+00 1.287546e-05

Model 1: Surv(Death_Time, Death30d) ~ B
Model 2: Surv(Death_Time, Death30d) ~ rcs(B, 4)

L.R. Chisq d.f. P
6.87369492 2.00000000 0.03216593

By running the multivariable Cox regression analysis with the two binomial variables C and D and the three-level categorical variable E, the output was the following:

Model Tests Discrimination Indexes
Obs 13734 LR chi2 321.92 R2 0.082
Events 241 d.f. 10 Dxy 0.595
Center 1.0435 Pr(> chi2) 0.0000 g 1.065
Score chi2 517.98 gr 2.901
Pr(> chi2) 0.0000

Coef S.E. Wald Z Pr(>|Z|)
A 0.0070 0.0334 -0.21 0.8340
A’ 0.0868 0.0816 1.06 0.2876
A’’ -0.1248 0.2720 -0.46 0.6464
B 0.2161 1.7052 0.13 0.8992
B’ 8.7123 6.4028 1.36 0.1736
B’’ -31.1359 14.8107 -1.26 0.0979
C=2 0.6491 0.1698 3.82 0.0001
D=2 0.5393 0.1392 3.88 0.0001
E=E2 0.3796 0.2235 1.70 0.0895
E=E3 0.8292 0.2135 3.88 0.0001

Wald Statistics Response: S

Factor Chi-Square d.f. P
A 135.88 3 <.0001
Nonlinear 13.40 2 0.0012
B 69.54 3 <.0001
Nonlinear 5.22 2 0.0735
C 14.62 1 <.0001
D 15.02 1 <.0001
E 18.28 2 <.0001
TOTAL NONLINEAR 20.74 4 0.0001
TOTAL 388.08 10 <.0001

Question #1: In contrast with the same model including the variables A and B without rcs transformation (linear), in which both variables showed highly significant association, the output attached above do not show significant P values for any component of A and a significant P value only for the first component of B. If I have correctly understood, A/A’/A’’ and B/B’/B’’ describe the differences in cubes that restrict the functions of A and B to be linear beyond the outer knots. Should I have concerns of such output?

Question #2: By running the ANOVA, the nonlinear component of the variable B was not significant. According to this result, should I avoid using rcs for B or this is not important if overall B is significant (chi 69.54, df 3, <.0001)?

Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
A 55.00000 72.5610 17.56100 0.905100 0.22260 0.46880 1.341400
Hazard Ratio 55.00000 72.5610 17.56100 2.472200 NA 1.59810 3.824400
B 0.79891 1.0984 0.29945 0.732340 0.14288 0.45230 1.012400
Hazard Ratio 0.79891 1.0984 0.29945 2.079900 NA 1.57190 2.752100
C - 2:1 1.00000 2.0000 NA 0.649140 0.16977 0.31639 0.981880
Hazard Ratio 1.00000 2.0000 NA 1.913900 NA 1.37220 2.669500
D - 2:1 1.00000 2.0000 NA 0.539270 0.13915 0.26654 0.812010
Hazard Ratio 1.00000 2.0000 NA 1.714800 NA 1.30540 2.252400
E - E1:E2 3.00000 1.0000 NA -0.89407 0.21627 -1.317900 -0.470200
Hazard Ratio 3.00000 1.0000 NA 0.40899 NA 0.267680 0.624880
E - E3:E2 3.00000 2.0000 NA -0.20755 0.12514 -0.452810 0.037724
Hazard Ratio 3.00000 2.0000 NA 0.81258 NA 0.635840 1.038400

Question(s) #3 (Most Important): What is the most appropriate and straightforward (for medical doctors) way to report the results of such a multivariable model including transformed continuous variables? The interpretation of the linear version of these continuous variables HRs and 95% CIs is relatively simple (linear HR variation per covariate unit change) but after rcs transformation I am struggling with the a simple and efficient way to report the results (for medical doctors). The function “summary.rms” provides single coefficients and standard errors / HRs and 95% CIs for the continuous variables handled by rcs that are quite large compared with standard linear covariates, where for example HR and 95% CI for year of age increase are definitely smaller and wider, respectively. What do the HRs and 95% CIs of the rcs-transformed variables A and B mean according to the “summary.rms” output?

I extracted the mathematical representation of the model (“Function”). I know that probably this is a good solution under a statistical point of view but you might agree that this would be extremely hard to understand for non-statisticians.

What does “significant” mean? Much less than you think.

You are implying that models are things you can play with to get different answers (I know you don’t mean to imply this but that’s the effect of this line of reasoning). Pre-specify the model from general principles and knowledge of the data. This usually entails using a full model. Don’t think about removing “insignificant” terms such as nonlinear effects.

Interpretation comes from the use of partial effect plots and from plotting contrasts against a constant value of a predictor, against a grid of say 200 values of the predictor. The contrast function let’s you have full control of the contrasting constant if you don’t want to use the default partial effect plot generated with ggplot(Predict(...)).

2 Likes

Thank you for your answers. I think your comments address the first two questions. What about the last question (#3)?

I did answer that. You have three choices, all of them intuitive to non-statisticians:

  • to expose the relationship of one continuous variable to outcome holding all other predictors constant, show a partial effect plot
  • you can use a single-axis nomogram for the same purpose. When there is only one predictor, one side of the axis can be labeled with an outcome prediction (e.g. 3-year survival probability) and not just relative effect (e.g., relative log hazard)
  • when there are multiple predictors and you don’t want to fix most of them you can use a full nomogram to get predicted values on many scales (mean survival time, mean, 3y, etc.). Nomograms show direction, strength, shape of effects and also provide overall predicted values.

In short, don’t burden clinicians with equations; give them graphs.

I see, probably the model nomogram is the best choice.
I have however a last unaddressed question. What is meaning of the HR / 95% CI of the variables A and B in the example above as generated by “summary” after fittiing a model with “lrm”? The interpretation cannot be the same of a covariate without rcs transformation, estimates and CIs are excessively large for a unit change in the covariate. Thank you in advance for your time and your patience.

summary computes a point estimate and corresponding compatibility interval that means the same whether you splined the predictor or not. The default point estimate is \exp(b - a) where a and b are respectively the lower and upper quartiles of the predictor. Such a point estimate is a good summary only in the context where someone invalidly forces you to summarize an entire relationship with one number.

Note that inter-quartile-range hazard ratios don’t mean as much if the relationship is very nonlinear, and don’t mean anything if they are non-monotonic.

Section 2.4.1 describes how dichotomizing a continuous predictor can lead to results that are arbitrary and manipulatable. I wrote a short simulation showing how an odds ratio calculated using a dichotomized predictor can vary with the inclusion criteria, while logistic regression using the continuous predictor produces a stable estimate.

1 Like

Hello,

Apologies if this is not where this question should be posted, but I am stuck trying to reproduce the example on p. 55 of the handout. Can you help ? This is my code (copied from the handout):

require(rms)
Hmisc::getHdata(sicily)
d ← sicily
dd ← rms::datadist(d); options(datadist=‘dd’) # the options command is needed for Predict later (why exactly I am not sure)
options(prType=‘Latex’)

g ← function (x) exp(x) * 100000
w ← geom_point(aes(x=time , y=rate), data=d)
v ← geom_vline(aes(xintercept =37, col=I(‘red’)))
y1 ← ylab('Acute Coronary Cases Per 100,000 ')
off ← list(stdpop=mean(d$stdpop))
f ← glm(aces ~ offset(log(stdpop)) + rms::rcs(time , 6), data = sicilyDF, family = poisson)
f$aic

ggplot(Predict(f, fun=g, offset=off)) + w + v + yl

I get the following error:

Error in .subset2(x, i, exact = exact) :
attempt to select less than one element in get1index

This is posted in the wrong topic, uses symbols that cannot be copied and pasted into an R command window, uses the wrong prType (use ‘latex’ noting the case), and doesn’t provide the version of rms being used. Also don’t use formula elements such as rms::rcs. Use require(rms) at the top.

I see. Where should this be posted ?

Here is the sessionInfo():

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

other attached packages:
[1] rms_6.3-0
.

Sorry this is the right topic. I hope that removal of rms:: from a formula term fixed your problem.

That did not solve it, no. Any other suggestions welcome!

Are you using ‘glm’ in your code or ‘Glm’? Lower case is the native R function, upper-case the RMS version (the latter being the one used in the hand-out while the former what you seem to be using). Have you tried this?

1 Like

Yes, switching ‘glm’ to ‘Glm’ was what I needed to do. Thank you for the help!

1 Like

How to decide whether to go for linear or non linear in the model for prediction?
I need an advice. Conditional distribution plots shows a non linear relation. In a prediction model setting, how do we decide to include non linear form like RCS or not. Can we prespecify two models, one with only linear and other model with RCS and decide based on LR test or Each against a supermodel?

The RMS book and course notes cover in depth the issue of model pre-specification and why looking at the data is not as good as thinking about what you know before looking at the data. Some of the advice, oversimplified, is to allow, if the sample size permits, variables that are not known to act linearly to be flexibly modeled. Linearity occurs infrequently in nature. If you do want to do a formal test of linearity (not recommended for decision making, as explained above) then a likelihood ratio test where you compare with a “supermodel” is a good idea, but only if done once. You might allow all variables to be nonlinear, and compare that with a model with all variables linear. This gives rise to a many-d.f. “chunk test” leading to a decision of keeping all variables nonlinear or none of them. AIC may be a little better for this purpose.

2 Likes

Thanks dear professor for clarifying doubts and reinforcing the concepts. Your notes, book and rms course were really helpful, though I need to go back again and again to get things clearer, me not being a statistician
In our set up , I find it very difficult to explain to clinical epidemiologists here why I decide to go in for a pre specified model or models with a priori selected candidate predictors from literature search, expert opinion (available at the point of prediction and routinely done) rather than data driven variable selection or backward elimination. I still try to persist at the risk of being looked upon.

1 Like

Many epidemiologists have not been taught model uncertainty and hazards of stepwise variable selection. The largest single damage done by these is invalidating standard errors of effect estimates, making the standard error too small. The main reason we push pre-specification is because the alternative performs so poorly, with model instability also being a major problem.

2 Likes