RMS Describing, Resampling, Validating, and Simplifying the Model

FurlanLeo · October 24, 2023, 6:27pm

Hi Professor, thank you for the suggestion.
I couldn’t find anything related to the errors with str(calibrate()).
But I think I found a way:

c <- calibrate()
p <- plot(c)
mean(abs(p))
quantile(abs(p), probs = 0.9)

The errors are stored in p.

FurlanLeo · October 25, 2023, 8:43pm

I have been doing some prediction analyses/modeling and could use some feedback.

The data are displayed on the figure attached, which is a print of describe();
x1, x2, x3, x4, and x5 are clinical scales/tests usually performed by a physical therapist;
y is a functional scale (integer values from 0 to 57);
xs are all performed at baseline and y is measured after ~ 3 months;
DOM, BAM, SEX, AGE constitute a Base Model:

f1 <- lrm(y ~ DOM + BAM + SEX + rcs(AGE, 3), data = data, x = T, y = T)

Goal is to determine which of the clinical scales/tests (the xs) add more predicitve power to the Base Model;
I fitted 6 other Models on top of the Base Model, 1 for each x, with the latter Model featuring a combination of x1 and x2:

f2 <- lrm(y ~ DOM + BAM + SEX + rcs(AGE, 3) + x1, data = data, x = T, y = T)
f3 <- lrm(y ~ DOM + BAM + SEX + rcs(AGE, 3) + x2, data = data, x = T, y = T)
f4 <- lrm(y ~ DOM + BAM + SEX + rcs(AGE, 3) + rcs(x3, 3), data = data, x = T, y = T)
f5 <- lrm(y ~ DOM + BAM + SEX + rcs(AGE, 3) + rcs(x4, 3), data = data, x = T, y = T)
f6 <- lrm(y ~ DOM + BAM + SEX + rcs(AGE, 3) + x5, data = data, x = T, y = T)
f7 <- lrm(y ~ DOM + BAM + SEX + AGE + x1 + x2, data = data, x = T, y = T)

Then I produced several (optimism-corrected) measures of Model Performance for each Model, using the following code:

validate(f, method = 'boot', B = 500, group = data$y) # internal validation using bootstrap
calibrate(f, method = 'boot', B = 500, group = data$y) # calibration curve using boostrap
pred <- predict(f, type = 'lp', ycut = 36) # predicted probabilities for y>=36
1-var(predBase)/var(predx) # fraction of new information

These measures are displayed on the figures attached;
From these analyses I conclude that x4 makes the greatest contribution to increasing predictive power of the Base Model;
x1 and x2 are much simpler/quicker tests to perform in clinical practice, and they also seem to be important for predicting y (either in isolation or combined), but their performance indexes are a bit conflicting in my view;

Any comments?

describe

f2harrell · October 26, 2023, 1:03pm

Not sure this is posted in the right topic. Should have probably been in RMS Multivariable Modeling Strategies - modeling strategy - Datamethods Discussion Forum

You are going to way too much trouble for the goal you have. Add all of those variables at once and do a likelihood ratio chunk \chi^2 test for added value. And describe the added value using a type of partial R^2 measure. And compute the proportion of log-likelihood explained by the new variables. See also Statistical Thinking - Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements

FurlanLeo · October 26, 2023, 2:08pm

Thank you Professor.
Ok, I will move the post then.

Actually, my analysis is based exactly on this. That’s why I made the back-to-back histograms and the fraction of new information plot.

So it’s better to have a full model with all the xs in and then compute e.g., Proportion of Overall Chi-Square and then Relative Explained Variation using bootstrap, as shown in Section 5.1.3 of RMS?

f2harrell · October 26, 2023, 2:33pm

Don’t need to move the post, just keep in mind for next time. Yes to your questions. The upcoming version of rms has the rexVar function which automatically incorporates bootstrap repetitions if you’ve run bootcov, to get uncertainties in a new relative importance measure. The rms course notes are already updated to include an example of this.

FurlanLeo · October 29, 2023, 12:44pm

Thank you Professor!

Here are the results of my updated analysis.
Please, any comments or suggestions for improving it?

Chunk test / likelihood ratio chi-square test for added predictive value of the clinical measures altogether:

#   Chi-Square         d.f.            P 
# 6.386165e+01 7.000000e+00 2.545708e-11

REV (500 bootstrap resamples of the full model):

REV1344×960 9.4 KB
Fit x4 model:
lrm(formula = y ~ DOM + BAM + SEX + rcs(AGE, 3) + rcs(x4, 3),
data = data, x = T, y = T)
Anova (Likelihood Ratio Statistics):
Significant association of AGE (df=2; p=0.0007) and x4 (df=2; p<0.0001)
Bootstrap model and partial effects plots:

Plots1344×960 17.3 KB
Model validation with bootstrap (B=500):
Somer’s D (corrected index) = 0.5455
Model calibration with bootstrap (B=500):
Mean absolute error = 0.014
0.9 Quantile of absolute error = 0.025

x4Model1344×960 15.3 KB
Nomogram:

Nom1344×960 11.3 KB

f2harrell · October 29, 2023, 12:58pm

Nice work. I don’t think of any suggestions.

FurlanLeo · November 4, 2023, 6:54pm

Thank you Professor, I appreciate the feedback.

I just want to make sure I got this right…

The REV approach is much more straightforward, but is the other approach wrong (Question 1)?

By other approach I mean when I fit a base/ reference model and a separate model for each new predictor (i.e., base model + new predictor) and then compare these models in terms of their predicitve performance, using discrimination, calibration and prediction error indexes, e.g., D_{xy}, E_{max}, Mean Absolute Prediction Error, etc.

If this approach is correct, then what would be the most appropriate index for choosing the best model (Question 2)? The reason I ask is because sometimes these indexes are in disagreement, e.g., for index a, model w is the best, but for index b, model z is the best.

Section 4.10 from RMS contains a list of criteria for comparing models. If I undestood correctly, calibration seems to be the most important criterion if the goal is to obtain accurate probabilities for future patients, whereas discrimination is more relevant if we want to accurately predict mean y.

This matters to me beacuse in my case (PO models fitted with lrm) y has 35 intercepts and all the bootstrap validations and calibrations are performed using the middle intercept, but from a clinical perspective I think it is more relevant to predict mean y. Does that mean I should focus on discrimination (Question 3)?

Finally, considering the recommendations on section 13.4.5 from RMS, when describing my selected best model, besides presenting partial effects plots and a nomogram, both on the mean scale, what else could I do (Question 4)? I think it could also be relevant to report predictions (on the mean scale and with confidence limits) for new subjects using expand.grid and predict.lrm. But I’m having trouble with the arguments of Mean.lrm:

new <- expand.grid()
m <- Mean(f, codes = T, conf.int = T) # what about the design matrix ‘X’ (Question 5)?
lp <- predict(f, new)
m(lp)

Thank you for your time Professor.

f2harrell · November 16, 2023, 11:42am

Concerning model comparison I would not use that approach but would rather use the approach here. This emphasizes the gold standard log likelihood-based measures plus measures built on dispersion of predicted values on the linear predictor or probability scale. Don’t fit separate models for “each new predictor” as this is too similar to stepwise regression. Pre-specify 2-3 models for the competition.

This leaves the question of when to consider calibration. Usually I demonstrate good overall calibration in the final, single, model.

FurlanLeo · November 22, 2023, 2:30pm

Ok Professor,
Thank you very much for the valuable suggestions.

FurlanLeo · February 15, 2024, 10:47am

Hi Professor @f2harrell,

One final question that remains on this analysis approach (https://discourse.datamethods.org/t/rms-describing-resampling-validating-and-simplifying-the-model/4790/137?u=furlanleo):

I understand that the Relative Explained Variation index, computed by rexVar, should only be used in a final, pre-specified model, and not for selecting important predictors.

Therefore, I believe there’s a problem with my approach from above, where I first do a chunk test for the added predictive value of 5 new measures (X1, X2, … X5) and then, after confirming that there is indeed added predictive value, I use rexVar to decide which of these 5 new measures is more important, and finally I select this more important measure (e.g., X4) to build my final model.

If that is the case, then I think a better approach would be to first select 3 of the 5 new measures and build a separate model for each, and then use AIC to compare these 3 models and select the best one to be my final model:

Model 1: y ~ DOM + BAM + SEX + rcs(AGE, 3) + rcs(X4, 3)
Model 2: y ~ DOM + BAM + SEX + rcs(AGE, 3) + rcs(X1, 3)
Model 3: y ~ DOM + BAM + SEX + rcs(AGE, 3) + rcs(X2, 3)

Is my understanding correct?

I think this is in line with the idea that, if the sample size is not large enough, then selecting winners is a futile task, as shown in the “boostrap ranking of predictors” example here.

Thank you.

f2harrell · February 15, 2024, 1:10pm

No that is not correct. Just pre-specify the model, allowing for as many parameters as you think may be needed, subject to limitations imposed by your effective sample size for Y.

FurlanLeo · February 15, 2024, 3:33pm

Sorry, I don’t think I expressed myself well.
Could you kindly have another look at what I did here?

You said afterwards that it was a “nice work”, but I’m confused because after confirming the added predictive value of the 5 new measures altogether, I did a REV analysis using rexVar and decided that X4 was The Best measure among the 5, and then I built my prediction model with X4 only (besides the “base” predictors). Isn’t that a bit like variable selection?

That’s why I mentioned in my last post that it was my understanding that rexVar should only be used in a final model (but not to specify this final model).

So the big question is: what to do when you want to select, among a set of new clinical measures, only one measure to include in your prediction model?

I’m aware that I should preferably be fitting a full model with all the measures in (possibly with some form of penalization) and then using such model as my new prediction model. But the central ideia is to choose among the new measures only one, as using all of them in clinical practice would be too laborious. Is that reasonable?

Thank you!

f2harrell · February 16, 2024, 1:31pm

Just like when running anova(), the fact that you want to quantify which predictors are very important doesn’t mean that should lead to changes in the model. You are getting into the classic Grambsch-O’Brien problem discussed in RMS.

FurlanLeo · March 16, 2024, 4:23pm

Dear all,

I’m developing a prediction model using different strategies (N = 122; DF = 10; continuous discrete response; using lrm()):

Full model unpenalized;
Full model penalized (maximization of AIC);
Full model penalized (penalty yielding effective DF of approx 122/15 = 8);
Reduced pre-specified model (built using background knowledge);
Reduced model (built with backwards elimitation of predictors using AIC);
Reduced model from 5. estimated with shrinkage of coefficients using penalty from either 2. or 3.

I’m validating all these models with bootstrap and also producing calibration measures with bootstrap.

Questions:

Which performance indexes are most usefull/adequate for comparing these models? E.g., Emax, Slope, Mean Absolute Error etc

Which of these models is the most appropriate for making inferences about the relationships between predictor and response?

Thank you

f2harrell · March 16, 2024, 9:54pm

That’s too many models, lowering the chance that AIC can select the best model with reasonable probability. I’d go with subject matter knowledge to the extent possible. Note that AIC is not valid in the stepwise variable selection context. I’d also make sparse principal components part of the game plan.

Good performance metrics into E90 and those described here.

FurlanLeo · March 17, 2024, 11:43am

Hi Professor, thank you for the comments/suggestions.

Actually, I was thinking of going with only model 4 in the paper (my firtst choice really). I would fit the other models and show their performance (calibrationwise) just for comparative purposes and to show the audience/readership how models like 1 and 5 are usually not appropriate for the task.

Please note that I’m not comparing these models against each other on the basis of their AIC. I used AIC only to estimate the penalty factor for the full penalized model (model 2) and to perform backwards variable selection (fastbw(fit, rule='p', type='individual', sls=0.157) (model 5). Then I applied the penalty from model 2, or the penalty yielding an effective DF of 8 (model 3), to this reduced selected model*, resulting in model 6.

*as suggested by Steyerberg (page 259 of the 2nd edition of his book, Clinical Prediciton Models):

“For penalized estimates of the regression coefficients after selection, we can
apply the penalty factor that was identified as optimal for the full model, before
selection took place”

By E90 do you mean the 0.9 Quantile of Absolute Error, as reported by calibrate()?

Thank you

f2harrell · March 17, 2024, 12:11pm

Yes on the last question. And yes that makes sense if the penalty is estimated for the biggest model.

FurlanLeo · March 18, 2024, 2:41pm

Thank you Professor. Much appreciated feedback.

Just to illustrate, here’s a plot with the E90 measures from each of these models.

Models

The model yielding the lowest error is the model selected with fastbw(fit, rule='p', type='individual', sls=0.157 and then validated taking selection into account (4th model from left to right on the plot). This model outperforms the pre-specified reduced model (last model on the plot). Interestingly, if I take the selected 4th model and apply the penalty factor from the full penalized model, this results in worse peformance (5th model on the plot).

Is it safe to perform inference with this 4th (backwards step-down, validated) model?
By inference I mean interpreting partial effects plots and contrasts.

f2harrell · March 18, 2024, 3:49pm

Minor suggestion: horizontal dot charts allow you to more easily read the results because the model labels will be in larger font next to the dots they represent.

I don’t know the answer to your question because

Calibration, though super important, is only one dimension of accuracy
You are comparing too many models to be able to avoid damaging model uncertainty