RMS General Regression

Regression Modeling Strategies: General Aspects of Fitting Regression Models

This is the second of several connected topics organized around chapters in Regression Modeling Strategies. The purposes of these topics are to introduce key concepts in the chapter and to provide a place for questions, answers, and discussion around the chapter’s topics.

Overview | Course Notes

While maybe not the sexiest part of RMS, apprehension of notation can be especially important for accessing important RMS concepts and maneuvers, such as chunk tests involving interactions, and interpreting and expressing effects. If you are new-ish to modeling, skipping over notation might not actually save time and effort.

The concept of the regression function is a function that describes interesting properties of Y that may vary across individuals in the population. C’(Y|X) denotes some function for a property of the distribution of Y conditional on X, expressed as a weighted sum of the X's (the linear predictor).

Interaction (aka, “effect modification”) indicates that the effect of two variables cannot be separated: i.e, the effect of X_1 on Y depends on the level of X_2, and vice versa. This makes it difficult to interpret \beta_1 and \beta_2 in isolation. Importantly, rather than just a difference in slopes, interaction may take the form of a difference in shape (or distribution) of a covariate-response relationship conditional on another variable.

Almost every statistical test is focused on one specific pattern to detect; and therefore inference from every statistical test should be appropriately qualified by this specific pattern.

One of the most valuable maneuvers for general practice is the use of regression splines to relax linearity for continuous covariates. Its low prevalence in general practice does not seem proportionate to its merit. For folks new to rms::rcs()----or now, rms::gTrans()— etc, don’t get intimidated or frustrated by sections 2.4.2 through 2.4.6; just pragmatically emulate FHs practice in the examples (and see links below), and let full understanding follow in good time. ‘Trust in the Force’ of these magical functions: Use splines to avoid categorizing continuous variables or imposing naive linearity.

Categorizing continuous variables leads to loss and distortion of information. Categorizing continuous variables does violence to statistical operating characteristics and is unnecessary. (Don’t do it!)

Likewise for chunk tests (Section 2.6) — the low prevalence in practice does not seem proportionate to its merit. Emulating FHs practice in evaluating complexity of effects (interactions, and spline terms) or redundancy (groups of covariates, etc) has mitigated much statistical complication.

A reasonable approach for modeling and evaluating complex interactions is for each predictor separately, to test simultaneously the joint importance of all interactions involving that predictor.

Additional links



Q&A From the May 2021 Course

  1. :question: Greenland (e.g., in Modern Epidemiology, 2008) holds that the concept of the regression model—which comprehends the design, the expression of the model form, and assumptions—is a concept separate from the regression function; and that regression modeling (fitting, estimation, and evaluation) is moreover also a separate notion. Is this correct? Useful? Helpful? Important or pedantic?

  2. I think I am struggling conceptually with how much data you need to actually be able to fit models like this (restricted cubic splines specifically), while more accurate, has to require a very large sample size to uncover associations that aren’t incredibly large. Or am I just being overwhelmed with the complexity behind anything that is truly useful?
    [dgl: Analysis is a craft / art: and it will benefit from thinking about the information in the data and where the information is—subject matter knowledge. A lot has to do with the nature and quality of the response variable, your prior understanding/beliefs about relationships. Actually, splines allow you to use limited data more efficiently; so I think it mis-characterizes the challenge to presume lots of data are necessary for splines/relaxing linearity assumptions. It also has a lot to do with the quality of your inference: how you qualify the patterns you see and what your uncertainty is. Let’s discuss more. FH: This is very well said. We will be talking about sample size on Wednesday.

  3. We talked about categorising in modelling. What about categorising for the purpose of presenting descriptive data? Is it ever useful? [dgl: I recommend you consider how use graphics or good visualizations before categorizing. Categorizing always obscures information; and, more often than not, distorts understanding] Thanks! FH: We’ll see in the Titanic case study that categorizing for the purpose of descriptive statistics can be a complete disaster.

  4. Is it important to evaluate the linearity assumption for each continuous candidate variable, or is it appropriate to model using splines as a default? [dgl-I think starting with splines is a good default, unless you have very strong subject matter presumptions (such as pressure vs volume of a gas)] If you have enough effective sample size you can spline every continuous variable. But we’ll discuss on Wednesday an approach that allows you to use different numbers of parameters for different predictors. And we’re not evaluating the linearity assumption per se but are just allowing for nonlinearity.

  5. Can you share your thoughts on B-Splines? Any guidance on choosing between B-Splines and Restricted Cubic Splines? FH: B-splines are fine, and have less collinearity among terms. But they are more complex, so are not usually worth the trouble unless using a fitting routine that doesn’t handle collinearity well (e.g., SAS).

  6. Is any kind of action based on a model diagnostic considered data dredging due to the inference being conditioned on complete model certainty after alteration (the resulting model wasn’t pre-specified)? So ideally, would the only place for model diagnostics be to assess the appropriateness of an analysis after it’s done? Running diagnostics and changing the model can indeed distort statistical inference. I like to choose models that require fewer diagnostics as a result, e.g., semiparametric models.

  7. 2.4.5. In a model where we use restricted cubic splines of some continuous covariates, and you really need to interpret the effect of these covariates in the model, how can you do this? Is there a workaround to address the issue of interpretability? Could one workaround be to fit a piecewise linear model for ease of interpretation and a restricted cubic spline model for graphical representation? FH: One does not need to fit an alternate model. Splines are interpreted graphically.

  8. Can you elaborate on the difference between natural cubic splines and restricted cubic splines? Are the settings for their application different? FH: They are one and the same, only most people use B splines when fitting natural splines.

  9. Any thoughts on Dennis Cook’s envelope methods, which (as I understand it), reduces the dimension of the predictors while using the information in Y (ie, not masked) to do so in a way that won’t lose the part that matters?

  10. Thoughts on relaxed lasso? FH: It can’t work. The shrinkage has to be carried forward or model calibration will be bad.

  11. Do we make different choices about data reduction and model selection if we’re doing an exploratory (hypothesis-generating) study rather than a confirmatory (hypothesis-testing) study? FH: probably

  12. Can you/do you interpret coefficients of a cubic spline? FH: you don’t

  13. So if you care about estimating (say) the average effect of one variable on the outcome, then you can’t model that variable with cubic splines. FH: I don’t know what “average effect” means.

  14. Do issues of phantom degrees of freedom apply to Bayesian approaches or is it a frequentist specific issue? FH: Good question. Bayes provides a more cohesive approach in which you put extra parameters into the model to absorb lack of fit, and put priors on them.

  15. Can you clarify when it is appropriate to explore for interactions prior to fitting the final model? Seems inappropriate to use the data to inform the presence of interactions between candidate variables, then to add them to the final model using the same data. FH: The only “safe” exploration is to pre-specify a list of interactions and test them all jointly with a chunk test. Then either keep all or remove all.

  16. I’m a bit confused about the use of splines (possibly with interactions) and the idea of estimation. Frank said you shouldn’t spend too much time interpreting individual coefficients. You could do a chunk test of all the different X variables that include your particular X variable of interest, but that is a hypothesis test as I understand it, and I was under the impression we wanted to move away from testing to estimation. If you can’t interpret individual coefficients (which give you an estimate of the effect of that variable on the outcome?), then what inferences can you make? about e.g. the effect of X1 on Y. I suppose that you can make better predictions with your model than you might otherwise have been able to. FH: The main approach here is partial effect plots as we’ll see on Wednesday.

  17. How do you assess fit of splines? FH: primarily we posit a spline that must fit as well as needed within the restriction that we don’t overfit. So if the spline doesn’t fit there is nothing we can do about it anyway.

  18. So there is no metric to formally assess fit? We mainly rigorously check the fit of the overall model, e.g. calibration curves for absolute accuracy. For a single variable the fit is relative, e.g., how bad is a fit with 4 knots compared to putting 6 knots on the variable. But since we should start out with the maximum number of knots we are comfortable with, this is somewhat moot.

  19. In what situations do you use chunk tests? FH: many!

  20. My clinician colleagues are used to seeing results of a continuous variable from a logistic regression written in OR/95% CI per unit of the variable as it is commonly modeled as linear. When modeled as a RCS, how should the results best be communicated (this goes for a manuscript as well) if we are to avoid evaluating/interpreting coefficients? FH: covered on Wednesday.

  21. 2.8. In the example of rates of hospital admission, if one had found that the jump effect was not significant (and assuming testing the jump was not our primary interest), would it make sense to report results from a final model without the jump effect? FH: No. We need to abandon “significance” and we’ll distort inference if we remove “insignificant” predictors. Use the confidence interval for the variable whether it’s “significant” or not.

  22. 2.4.5. If we test for non-linearity in a model fitted with splines and find it significant, but the predictions produced by the spline model are very similar to the predictions produced by the linear model, could it be appropriate to report results of the linear model because the latter it’s much simpler to interpret? FH: It’s OK to use estimates from a linear fit but you must use the confidence band from the spline fit to be accuracy. So you might as well keep the spline fit. The approach you outlined creates model uncertainty.

  23. What is the best way to examine for nonlinearity in our continuous variables? Should this be examined in each variable prior to model building, if so how? Should we assume nonlinearity based on previous literature/knowledge? Should we compare model fit with and without nonlinearity? Other? FH: If you don’t have good reason to assume linearity before looking at the data, allow for nonlinearity. We will go over a strategy where you force some unpromising variables to be linear though.

  24. Are there any important differences between multivariate adaptive regression splines and the approach we have considered? FH: MARS is a little more prone to overfitting but offers some nice interaction modeling capabilities I believe.

  25. In section 2-55 (page 84) there is some code: ‘off ← list (stdpop = mean (d$stdpop)) # offset for prediction (383464.4)’. I couldn’t follow from the comment what this is doing. Could you elaborate? (perhaps similarly) what is the purpose of the offset() function in the GLM formula that follows: ‘f ← Glm (aces ~ offset (log(stdpop)) + rcs (time, 6) , data =d , family = poisson )’? FH: offsets are the way that Poisson regression adjusts for the denominator (normalizes) so that you are then modeling rates and not population-change-dependent counts. The linear predictor is a log rate after adding covariate effects and the offset. [Thank you - I didn’t realise it was related to the fact that it was poisson regression so didn’t do effective searches. I found the explanation here useful too: When to use an offset in a Poisson regression? - Cross Validated ] FH: Thanks for the very helpful link which I’ve added to the course notes.

  26. On page 2-19 (p48 of pdf) I have a Q about notation. What does the capital “I” mean when forcing continuity? (not in the R code, but in the actual regression equation) FH: It just means a 0/1 indicator for condition not holding/condition holding. I should have used the Knuth notation [condition] without the I.

  27. Right, thank you. So in that case it is just an indicator to tell the model whether to include the second slope

  28. Suppose we are doing a case-control trial to test the effectiveness of vaccine 1 for a certain disease. 50% of the subjects who are vaccinated with vaccine 1 are vaccinated with vaccine 2, which is intended for the same disease. The reason they are recommended to be vaccinated sequentially is both of them have distinctive benefits. Is it preferable to restrict the analysis population to vaccine 2 free subjects (and double the study size) or just perform logistic regression adjusting for vaccine 2 (and including interaction term ) ? Does the percentage of vaccine overlap (50% in this example) impact the choice? FH: Probably a better question for datamethods.org.

  29. What are your thoughts on other Information Criteria, throughout the course enthesis has been on avoiding Parsimonious models; Is there any situation you would see yourself using the Bayesian Information Criterion (BIC) over AIC? BIC applies to the case where there is a true, finite dimensional model. I don’t think that’s ever the case in my field. BIC penalizes towards underfitted models.

  30. What is your opinion on linear model trees (such as the algorithm mentioned in https://doi.org/10.1198/106186008X319331)? I think it’s better than regular trees but would need to see added value over the usual flexible modeling.

  31. Instead of considering y as a known value in regression, should we incorporate an error function to account for variability in the response variable? Ordinary linear models already account for this through the usual error term. For other models it depends on what you mean by variability. If it’s measurement error, this an cause problems. If it’s variability due to not accounting for an important variable, the main solution is to account for that variable.