RMS Multivariable Modeling Strategies

The study objective is to compare the effect of 4 level discrete Exposure(A : 1,2,3,4) on outcome(Y) adjusted for confounders X1, X2, X3, X4. This main model y~A +x1 + x2 + x3 + x4 works fine. When i run this model, I get 3 estimates for my exposure(A) , beta_a=2, beta_a=3, beta_a=4 (beta_a=1 is the reference level).

As a part of sensitivity analysis I am interested in estimating the effect of subsets of exposure on outcome Y. For example the effect of a=1 on y , effect of a=2 on y, effect of a=3 on y and effect of a=4 on y, 4 different models. The problem is, all the X variables (x1,x2,x3, x4) that worked well with the main model does not work with the subset model. For example:
model with only a=1 works when i only include x1,x2,
y~A(a=1)+x1 + x2 *
model with only a=2 works when i only include x3,x4, so onā€¦
y~A(a=2)+x3 + x4
y~A(a=3)+x1 + x3

y~A(a=4)+x2 + x4*

So my question is, when comparing the effect of A on y, Can I compare the point estimates from main model y~A +x1 + x2 + x3 + x4 for example: beta_a=2, with the point estimates from the subset model y~A(a=2)+x3 + x4 when the confounder are different ? Would this not be a an apples to oranges comparisons ? Please advise.

I think itā€™s best not to think of this as fitting separate models but as formulating well-defined contrasts within the overall model.

1 Like

Thank you Frank, that makes sense.

Does it make sense to log transform an outcome variable that is a z-score, assuming I donā€™t have any -ve z scores ? In other words, my outcome is a continuous variable from 0 to 100. This is highly skewed , there are more subjects below the median, so this skewed continuous outcome variable is standardized to a Z score. Now does it make sense to, further log transform an already normalized Z score (outcome variable) even if hypothetically there were no negative Z score, all positive Z scores ?

z scores only apply to symmetric distributed variables, as the standard deviation does not work correctly when skewness is present.

1 Like

Ah, yes I remember you mentioned this during the lecture or in your book, I cant remember where, but I do remember you mentioned this before. That makes sense. I agree. Thanks as always Frank.

This topic is for questions about the R rms package.

This all looks very good with the only problem being the examination of a bit too many models.

Dear @f2harrell,

Iā€™m hoping you could help clear up a bit of my confusion about one of the methods of assigning d.f. to predictors thatā€™s mentioned in Ch. 4, specifically within the context of hypothesis testing.

I understand the use of plot(anova(fit)) to guide assignment of d.f., with the total available d.f. being based on one of the rules of thumb (e.g., available d.f. = n/15). And I understand using AIC to guide knot selection, assuming that within each model tested, you use the same number of knots for each continuous predictor (e.g., mod1: all continuous vars are linear; mod2: all continuous vars have 3 knots; ā€¦ modX: all continuous vars have X knots).

Itā€™s my understanding that both those methods (assignment of n/15 d.f. according to plot(anova(fit)), and using AIC) are acceptable within the context of hypothesis testing, and no adjustment to alpha, e.g., based on models submitted, is necessary if you use either of those methods.

The section in Ch. 4 about rules of thumb refers to a third method of assigning d.f.: the use of van Houwelingen & le Cessieā€™s shrinkage factor heuristic, which is explained in detail later in Ch. 4. I understand that method to be:

  1. Use rule of thumb to determine max available d.f.
  2. Assign d.f. according to plot(anova(fit)), theory, and prior research
  3. Fit this full model
  4. Get LR chi2 of full model and calculate (LR - d.f.)/LR
  5. If result from that calculation is much less than .9, calculate (LR - d.f.)/9 to get a new total d.f. that is more appropriate to your data
  6. Assign d.f. as in step 2. This represents your final model

What Iā€™m confused by is the next statement in this section, which if Iā€™m reading it correctly, seems to say that if youā€™re doing hypothesis testing, you should be fitting the full model (decided upon in step 2, above), and that you shouldnā€™t be fitting a model thatā€™s based on the method described in steps 4-6 above.

Am I reading that right? Or is the shrinkage factor heuristic method fine to use with hypothesis testing? I ask because some models Iā€™ve fit using the rule of thumb method have shown relatively poor shrinkage factor heuristics (e.g., around .7), despite using less d.f. than what the n/15 rule of thumb has suggested is OK. Iā€™ve taken this to mean my data have a lower signal:noise ratio than the data that the rules of thumb are based on, and I wanted to make sure I wasnā€™t jamming too many d.f. into an analysis than my data could support.

P.S. Many thanks go to you (and the other members of this community) for all the resources you devote to spreading statistical know-how. Iā€™ve shared some of the techniques you discuss in the book/on hbiostat/here (particularly the use of restricted cubic splines) with some of my colleagues, and they have been genuinely amazed (as I am) at how easy it is to model complex relationships with these techniques.

1 Like

Thank you Professor for the valuable feedback, itā€™s much appreciated.

Indeed, I was worried that this could be a problem.
Maybe 3 models would be the maximum then?
In that case, does this number (3) takes into account the model used for the chunk test (for testing added value)? If yes, then I take that I should aim for 2 models (without the new predictors) for the AIC comparison, and then have a third model (with the new predictors in) for the chunk test. Is that fair?

Final questions:

  1. At the initial stage of model comparison using AIC, is it ok to compare models with different predictors or should I restrict myself to comparing models that differ only in terms of the assumptions of linearity and additivity of the effects?

  2. I take that even for inference (hypothesis testing) I should still validate my model, e.g. through bootstrap, because a model that is too overfitted would not be appropriate for inference. Is that correct? If yes, what would be a reasonable criterion for judging whether or not the model is too overfitted? In other words, how much of a drop in R^2 during validation is acceptable before concluding that overfitting has occured?

As always, thank you very much for your time and patience.

This was not for assigning d.f. per se but was for giving a rough estimate of whether or how much to shrink.

1 Like

Hi Professor,

Is this a reply to my last questions above?
Anyway,
Is the shrinkage factor given by \hat{\gamma} = \frac{{\rm model}\ \chi^{2} - p}{{\rm model}\ \chi^{2}} equivalent to the slope of the calibration line?
This value is returned automatically by the function validate().
Values of \hat{\gamma} that are closer to 1 are better then.
The farther from 1, more overfitting. Is that right?

Thank you!

Yes that ratio is a good estimate of the calibration slope on the linear predictors scale if p is honest. validate does not use that formula but estimates the calibration slope directly using the Efron-Gong optimism correction. Smaller values of the slope indicate more overfitting.

1 Like

Ok, got it.
Many thanks Professor!

Thanks @f2harrell . I see. In that case, if one is hypothesis testing and is trying to determine the number of knots for the continuous variables, how would one use van Houwelingen & le Cessieā€™s heuristic? Is it simply a matter of:

  1. Set the knots for each continuous variable to the max number of knots suggested by n/15
  2. Calculate the shrinkage heuristic
  3. If the heuristic is close to 0.9, use this model for hypothesis testing. However, if the heuristic is much less than 0.9, reduce the number of knots (or make the variables linear) and go back to step 2

I.E., should one use an iterative process to effectively minimize the heuristic?

Donā€™t do that. Pre-specify the number of knots or us the strategy in Chapter 4 for specifying complexity based on predictive potential.

1 Like

Thanks @f2harrell - I appreciate your patience.

You mention in the book that AIC can be used to determine the number of knots for RCS, but only when each model you entertain uses the same number of knots for all continuous predictors within that model, e.g. you run the following models, and use that with the lowest AIC as the final model:

mod1: all Xs linear
mod2: all Xs have 3 knots
mod3: all Xs have 4 knots
mod4: etc

Within the context of hypothesis testing, why is this ā€œsame k for all Xsā€ approach OK, but altering k for each X variable individually is not? I gather the latter is akin to stepwise variable selection. Isnā€™t that also true for the ā€œsame k for all Xsā€ approach?

Yes itā€™s because itā€™s like stepwise variable selection. The ā€œfix k across all X ā€ approach is not so problematic, in the same sense that selecting principal components in descending order of variance explained is OK. The orders are pre-defined, unlike the in-and-out behavior of stepwise.

But note that the AIC strategy weā€™re discussing is not that reasonable, as forcing a variable that is very weak to have the same number of knots as a strong predictor is not reasonable. Thatā€™s why I present the semi-supervised ā€œallocate more nots to predictors with more predictive potentialā€ method.

1 Like

Thanks @f2harrell - that makes a lot of sense.

My remaining question is about the semi-supervised method you recommend. What Iā€™m struggling with is what to do in the following situation:

  1. You calculate n/15 and see you have enough df to comfortably fit 5 knots to each X for the initial saturated model, which youā€™ll use to assess the predictive potential of each X (i.e., how many knots to actually fit to each X in your final model), but
  2. The plot(anova(fit)) graph for this saturated model indicates some of the Xs have negative chisq - df values

I understand this amounts to ā€œthe predictive potential of those Xs is so low that they donā€™t warrant many knotsā€. But given I can afford to give those Xs 5 knots, and Iā€™m not trying to save knots for a more important predictor (because I can easily give all of them a lot of knots), Iā€™m not sure itā€™s defensible to deny knots to those unimportant predictors. Denying them knots when I can afford them seems a bit like stepwise variable selection, and against the general RMS ethos of ā€œspend knots wherever you can afford them and donā€™t take them backā€.

Is it simply a matter of ā€œgive them the maximum knots because you can afford toā€?

Dear Professor @f2harrell,

I have a question about Section 4.1 (Prespecification of Predictor Complexity Without Later Simplification) from the RMS course notes.

When deciding which strategy to use to allocate df to predictors (strategy from Section 4.1.1 or Section 4.1.2), I made the following interpretation:

If I have predictors that are too collinear, then I should use strategy 4.1.2 (Spearmanā€™s \rho^2, which is a bivariable, unadjusted measure of association). If collinearity is not a problem, but confouding is, then I should use strategy 4.1.1 (\chi^2 - df, which is an adjusted measure of association obtained from a saturated additive model).

Is my interpretation correct?

Thank you