RMS Describing, Resampling, Validating, and Simplifying the Model

f2harrell · August 27, 2022, 4:37pm

Were the 3 predictors completely pre-specified?

202 patients is too low for data splitting by a factor of 100. Validation will have to be by resampling (100 repeats of 10-fold cross-validation or 400 bootstrap resamples).

KirovDoc · August 27, 2022, 5:03pm

During training the model (n=159), I did 10-CV. Do you mean I should use all 202 patients for CV training? If so, how can I test my model without test data? A one-variable HL test was used to select 3 predictors from 10. It’s not perfect, I know.

f2harrell · August 27, 2022, 5:44pm

The fact that you would probably be overfitted with 3 but you really had 10 variables means that there are serious issues. HL should not be used for variable selection. One repeat of 10-fold CV is not sufficient. All patients should be used for model development and for model validation. That means that 100 repeats of 10-fold CV (or 400 bootstraps) need to re-execute all* supervised learning steps that involved the 10 candidate predictors. WIth your setup the only hope, I believe, is to do data reduction (unsupervised learning) to reduce the dimensionality of predictors down to perhaps 3 (e.g., 3 variable cluster scores such as \text{PC}_1 s), then fit the model on those 3. Variable selection is completely unreliable in your setting.

KirovDoc · August 27, 2022, 6:12pm

Let us assume that the selection of predictors was correct. How can I test my data without splitting it? For example, my sample size is 200 or 1000. I train my model with CV. Next, what? Without a test dataset, I cannot check my model. How can I calibrate it?
I need a robust algorithm of my actions for future research.
If my dataset less than 20000 (actually, it’s a huge data) I train my model with CV. Ok! I got a model. Do I assess my model using the same data (accuracy, sensitive, specificity and so on)? In this case, I can get very good metrics and AUC>0.8. But it’s correct or not? I need a test dataset, that I don’t have. If I will be use a train data for calibration, I also get a very good calibration plot.

R_cubed · August 27, 2022, 6:59pm

There are more details in Regression Modelling Strategies (the text by Frank), but you can use bootstrap resamples to do fits on synthetic data, testing your entire modelling process. Here is a paper describing the process.

f2harrell · August 27, 2022, 7:04pm

And if you don’t get the text you can access the full course notes at Regression Modeling Strategies

KirovDoc · August 27, 2022, 7:38pm

Thanks everyone for replies. I understood, that the bootstrap (or CV) validation is the best way in a model creation. But if I don’t have data for external validation, what is my next step or conclusion should be?

f2harrell · August 27, 2022, 8:39pm

Please read the material from RMS. Strong internal validation through resampling estimates the likely future performance of the model without wasting any of the sample. You develop the model using the whole sample, then repeat the entire development process a few hundred times using samples with replacement from the original sample (if bootstrapping) or using repeated CV. Resampling methods estimate how much the model falls apart when applied to a different sample. And the bootstrap is particularly good at estimating the volatility of variable selection. In other words the bootstrap will be honest in showing tremendous disagreement in which variables are selected across resamples. Which is a reason not to do variable selection at all.

KirovDoc · August 28, 2022, 4:59am

Sure, thank you so much for your help!

f2harrell · August 28, 2022, 12:34pm

You’re very welcome. There are many approaches to modeling but I favor approaches that live within the available information content in the data and that try to not make too many decisions (e.g., selecting variables) that each have a probability of being correct that is only between 0.2 and 0.8.

KirovDoc · August 31, 2022, 5:08pm

Lets me continue, please.
I use all my data (n=240) to train the model. So, I have two options for that (while without stepwise):

Library 'Care’t, CV
Model 1= train(Y~., data, trControl = trainControl(method = “repeatedcv”, number = 10, repeats = 100), method = “glm”,
family = “binomial”)
Library ‘rms’
Model 2 = lrm (Y ~., data)
Validate (Model2, B=400)
After modeling, I can create a smooth calibrate plot.

Please correct me if I am wrong. Using the two options above, two similar models with internal validation can be created. What are the next steps, if so?

I would like to do ROC-curves and confusion matrices. External validation data are not available to me. These models should I use on the train data?
In my conclusion, how should I describe parameters of internal validation? Calibration plot / AUROC / Accuracy or something else?

f2harrell · August 31, 2022, 7:39pm

Show the table produced by validate() and the plot produced by plot(calibrate(...)). ROC curves and confusion matrices are at odds with decision making and provide no useful information in this context.

KirovDoc · September 1, 2022, 10:18am

Dr. Harrell, what is the best way to check the model stability? What parameters are responsible for this?

f2harrell · September 1, 2022, 12:05pm

There are two kinds of stability, with most people speaking of the second. First, stability in the overall predicted value with respect to overfitting is checked with resampling and the validate function, using various metrics such as pseudo R^2 or c-index (derived from Somers’ D_{xy}; equivalent to AUROC when Y is binary). Second, if using variable selection (validate only implements backwards stepdown, but with a variety of stopping rules), validate prints a matrix of which variables were selected at each resample. When there is a lot of variety of the set of variables selected, the feature selection / model is unstable.

KirovDoc · September 1, 2022, 5:38pm

Could you please tell me how I can extract predicted probabilities from a nomogram for all of my data? Is this probability the same as the one I get using the predict function of my lrm model?

f2harrell · September 1, 2022, 6:01pm

You can use plot(nomogram()), print(nomogram()) (shows points tables), predict, and Predict. But nomograms are meant for one-observation-at-a-time calculations.

Uriah · September 3, 2022, 4:22pm

Is it possible to upload reproducible code?

KirovDoc · September 3, 2022, 4:50pm

A nomogram is simply a visual tool you can use to make manual predictions based on your model. You can get predicted probabilities for n cases by using the “predict” function of your model. There is no need to extract probabilities from the nomogram.

library (rms)
model = lrm (Y ~ X1+Xn, data)
predict (model, data, type = “fitted”)

KirovDoc · September 18, 2022, 8:56am

Doctor Harrell, please help. How to code a validate function with cross validation method in rms library?

validate(MyModel, method = “boot”, B=400) - it’s working
validate(MyModel, method = “crossvalidation”, B=100) - it is not working
I get the error. I want to get 100 repeats of 10-fold cross-validation

f2harrell · September 18, 2022, 11:57am

Use method='crossvalidation', B=10 but repeat the whole process 100 times and average. For code that does the averaging see this where you’ll see the val function:

val <- function(fit, method, B, r) {
  contains <- function(m) length(grep(m, method)) > 0
  meth <- if(contains('Boot')) 'boot' else
          if(contains('fold')) 'crossvalidation' else
          if(contains('632')) '.632'
  z <- 0
  for(i in 1:r) z <- z + validate(fit, method=meth, B=B)[
          c("Dxy","Intercept","Slope","D","U","Q"),'index.corrected']
  z/r
}

For your case the B argument to val would be 10 and r=100. You’ll have to modify the vector of names of accuracy measures to suit your needs.