Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements

Sorry for the bug. It’s fixed now.

Thank you very much for fixing this issue. :+1: :+1: :+1:

I am doing an analysis to evaluate whether and how data obtained during and immediately at the end of a procedure can help prognosticate a patient (I don’t like the term, but it’s “independent risk factor”-like). We have widely known prognostic variables and I’ve done an analysis looking at the associations of three new variables adjusted for this a priori information.

We acknowledge that some variables might have been ignored. There is a quite good prognostic model (M1) based on preprocedural data in our field, and most patients in our dataset are eligible for its application. I have evaluated the model performance in our sample and it had a good calibration and discrimination. To strenghten our analysis, I aim to assess the added value of the new variables after accounting for this model.

I’ve done this using a regression model with the new variables and a restricted cubic spline for the linear predictor of M1 as independent variables. On the one hand, I see that using it as a RCS probably leads to better adjustment and prevents the case of getting association for the new predictors merely due to a bad fit in some range of the predictions. However, model predictions come straight from the linear predictor, so I am not sure if it wouldn’t be conceptually more sound to use it as a linear covariate or even an offset.

Thoughts?

1 Like

If you get an improved fit by splining the old linear predictor, with or without inclusion of the 3 new variables, that indicates that either

  • the linear predictor was on the wrong scale for the new model (e.g., is is a linear predictor from a Cox model and now you are inserting it into a logistic model) or
  • the existing model fails a smooth calibration curve assessment

Using an offset is assuming that the LP has a slope of 1, which may not quite be safe if adding new variables that overlap with components of the LP.

Just for the purpose of estimated added value of the 3 new variables you could argue that LP should be given a maximum chance of fitting, by including the spline. You could go so far as to include all the component variables making up the LP.

It would also be interesting to predict each of the 3 new variables from all the variables making up the LP to find out about information overlap.

1 Like

Hello Frank, this page is a fabulous resource – thank you very much!

My question: what is the most appropriate metric to test whether an AI model for risk prediction has added predictive value compared a Cox model for risk prediction, both on the same group of patients? I assume that the C-statistic may not be the best option.

Background: I listed to talks by Frank and read his paper “Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements“. I understand the criticism of using the C-index and the preference for using likelihood ratio χ2test (LR) and AIC. However, this paper deals with added value of an additional variable in nested models.

I am interested in comparing non-nested models:

Typical case in our field: imaging approach A versus imaging approach B as predictor of fracture risk. The simplest classical model would be an age adjusted Cox model of a single image based predictor variable (e.g. bone mineral density). As the competitor model we have a deep learning imaging approach, that would already combine age and imaging aspects in one risk prediction outcome. So I have a prediction outcome per person from the Cox model and another prediction outcome from the AI model. On the outcome side I also have the gold standard event, the incident fracture yes/no, and competing risk of death.

I wish to compare the classical with the AI model. I run a training/ validating process and then test the performance on an independent holdout test set.

  • For the training of the classical model I run the Cox model and get the beta coefficients for age and the imaging predictor variable, and the Cox based risk probability for each patient of the training dataset. Then I apply the same (fixed) model on the test dataset and get the Cox based risk probability for each patient of the test dataset along with the gold standard event outcome.
  • For the AI model I run the training/validation e.g. 5-fold cross-validation or a bootstrapping approach and get a final model and the risk probability for each patient of the training/validation dataset. Then I run the AI in inference mode on the test dataset and get the AI risk probability for each patient of the test dataset along with the gold standard event outcome.

Which metric should I use to test whether the AI model surpasses the Cox model in risk prediction?

I wish to investigate two different metrics:

  • one that is most powerful for finding significant differences in risk predictive power and
  • another one that would be intuitive to understand by clinicians (i.e. allow them to judge whether the observed difference is large enough to be of clinical value which of course is somewhat subjective). I read a bit about net benefit and related concepts but I did not find them intuitive enough for clinicians.

Which metrics would you recommend for comparison in the training/validation dataset and in the test dataset?

Thank you for your advice.

1 Like

If I understand your process correctly, I think you will need to seriously evaluate if you have enough observations to do the validation by holding out data. Much of the recommendations of @f2harrell are based upon important papers by Julian Faraway:

Faraway, J. J. (1992). On the cost of data analysis. Journal of Computational and Graphical Statistics, 1(3), 213-229. (PDF)

A complete discussion on the use of split data methods must also include:

Faraway, J. J. (2016). Does data splitting improve prediction?. Statistics and computing, 26, 49-60. (PDF)

At least for medical applications, a strong argument can be made that split data approaches lose too much information relative to model specification and bootstrap validation.

Related thread:
https://discourse.datamethods.org/t/rigor-and-reproducibility-in-a-research-network/7161/15

While taking into account the great comments from @R_cubed I think it would also be good to explore where the machine learning approach can be completed restarted for each resample if you were using 100 repeats of 10-fold cross-validation or the bootstrap (which can fail in extreme overfitting situations). If it is too hard to do this and if both the training and test samples are very large, it will be easier to do the following:

  • Use the same training sample for both the classical and the machine learning training
  • For each method get the predicted log-log survival probability (linear predictor in a Cox model, for example). This is done only to prepare for the next step.
  • For each method compute the out-of-sample predicted log-log survival probability at the chosen fixed time. Here you use the two “frozen” models and reduce each of them to a specific vector of constants (one predictor).
  • Fit a model in the test sample that contains both linear predictors (classical + machine learning)
  • Do the “added value” likelihood ratio \chi^2 assessments

Thank you both @R_cubed and @f2harrell for your most valuable advice. We read the papers recommended and discussed this in our group.

We think we understand the comments (specifically the downsides of split samples). We are more clear about how to proceed but a few questions remain and we hope that we can get some further input from you. It may or may not be helpful for you to read some background information about our specific research question, so I put that optional information to the end of this comment.

A. Why we think we need to run a split sample approach for our specific problem

We have a very simple model situation without feature selection or model architecture selection, just comparing fracture incidence prediction based on 2 Cox models

  1. one predictor based on a model of AI outcome (from image analysis). The AI model needs to be optimized(hyperparameter search etc.) on the data at hand.
  2. an established fixed predictor: bone mineral density (BMD), all results known for all patients

In order to limit risk of overfitting of the AI model we want to freeze the AI model in the training-validation data set, to then compare the frozen AI model to the frozen BMD model in a hold-out test dataset both running Cox models. If we did not split the data, the AI model would be at an advantage compare to the BMD model since it was optimized on the same data, unlike BMD.

Question 1 : Is this reason for splitting acceptable or is there a better approach?

B. How to run the supermodel based comparison in non-nested models

  1. Preparation:
  • For the AI model we wish to get the strongest model (optimizing hyperparameter selection …). We use 60% of it for this purpose and then fit the outcome in a Cox model which then is frozen.
  • We fit the Cox BMD model on the same 60% of our data and freeze this model.
  1. Comparison of predictive power
    We wish to compare the AI and the classical BMD model to find out which is the better predictor in the out-of-sample set (40% of our data). Following your advice we calculate the log-log survival probability for the frozen AI model and we can directly calculate the likelihood score (LAI) for this dataset. We do the same for the BMD model and get its likelihood score (LBMD). Third, we run a combined Cox supermodel, getting the likelihood score (Lsuper) for the supermodel. We run an LR Test of Lsuper compared LBMD with df=1 to test whether AI adds to BMD and similarly an LR Test of Lsuper compared LAI with df=1 to test whether BMD adds to AI. So far we think we are OK.

Question 2: How can we test whether AI adds more to BMD compared to BMD adding to AI? We can compare the two increases in likelihood scores and get a numerical difference but is this statistically significant? Since these are non-nested models we think we cannot use LAI - LBMD and we do not know the df of such a test.

Question 3: Shouldn’t we fit the supermodel on the training dataset (60%), freeze that model, then run that frozen supermodel on the out-of-sample set and calculate the Lsuper that way?

C. Bootstrapping

We have been convinced by the arguments in favor of bootstrapping. It would be valuable to run them on the out-of-sample data to understand how much variability we have in the prediction.

Question 4: How many bootstrap runs are considered sufficient?

Question 5: Is there a standard R-package you can recommend for this bootstrapping analysis?

Thank you very much for helping us on our journey to find the best analysis strategy.

Claus

Optional background information on our research question:

We compare classical bone mineral density (BMD) tests to an AI model in order to test whether the AI is stronger than BMD in predicting risk of fracture.

BMD for each participant had been calculated the femur DXA images using the manufacturer software. This is not based on AI and actually represents the clinical standard in our field. The AI model has been run on the same femur DXA images and we wish to find out whether this has added predictive power.

We have about 9000 cases and about 1000 fracture events. With a 60/40 split we have about 400 events in the test data set.

We cannot use bootstrapping for AI model development for computing power reasons. We do not claim that we have found the best AI model, we just would like to assess the performance of our AI model.

We could run bootstraps on the out-of-sample test data set.

Currently we are not adjusting for any confounder but we will have to at least adjust for age in refined models yet to be run (because age adjustment is always done in our field and also makes medical sense, i.e. to differentiate aging from disease processes).

Very helpful blogpost and insights.
I am wondering how one would proceed in the case of multiple new variables and how to compare them.
Normally I would do it the way that I compute a multivariable model with clinical important variables and all the new variables, and look at the Chi2-df in a performance plot on variable level.
But what would be the correct way, if my sample size does not allow for all new variables in one model at the same time?
Is it okay to do it like:
Base Model
Base Model + new_var1
Base Model + new_var2
Base Model + new_var3
… and so on
and compare their added value by comparing their DeltaLR and fraction of new information?

And in a special case where new_var1, new_var2 and new_var3 are correlated (spearman ~0.3-0.4) it would be even worse to put them together in one model, right?

1 Like

Meanwhile I re-read some chapters in your @f2harrell incredible helpful book “Regression Modelling Strategies”, where in 4.10 you write:

Frequently, one model will have a similar discrimination index to another
model, but the likelihood ratio χ2 statistic is meaningfully greater for one. Assuming
corrections have been made for complexity, the model with the higher
χ2 usually has a better fit for some subjects, although not necessarily for the
average subject. A crude plot of predictions from the first model against
predictions from the second, possibly stratified by Y , can help describe the
differences in the models.

So it is correct, to compare χ2, but does this holds true for DeltaLR?

And how would a predictions-plot look for survival data? Do I plot the predicted risk seperately for cases and non-cases for one model against another? :thinking:

I think \Delta LR is going to give you the same idea as looking at LR.

Best to start with not stratifying on outcome, and plotting high-resolution histograms of predicted survival probability at a single follow-up time.

1 Like

Thanks for your answer and after some time of thinking, I now definitely better understand the concept of LR and Delta LR.

One follow-up question regarding overfitting in the case of looking at added value:
In my field of clinical research, risk prediction scores are often used to stratify patients into risk groups. In one of my current projects, we identified a group of patients which is not well predicted by the risk score but is by far not rare in practice. We thought of showing, that including a corresponding parameter into the risk prediction score could yield to better prediction. I read some papers about the rule of thumb, that in cox-regression one should approx. have 10 events per variable. The problem is, that in the used risk prediction score there are already 8 paramteres included.
Does this mean, if we build models like this:
Model 1: risk_score
Model 2: risk_score + new_parameter
that we would need approx. 90 events to avoid overfitting?
Or would this be 20 events cause the risk_score is one variable?

My tendency is going towards the first, so in this case my other question would be: Is there any way to get a idea, if the new_parameter improves the risk prediction, even though it would be only very small evidence? (Maybe by including only the most important parameters which are used in the risk_score?)

Thank you in advance. I am very happy to get so much informations and insightful discussions!

If the risk score was fitted on a different sample and it was not binned in any way, and it is represented in the model in a way that is likely to behave linearly, then it counts as one variable. Need more like 20 events per variable total but you can still make added value assessments with overfitting if you penalized LR \chi^2 for degrees of freedom. Look at proportion of LR gained by adding new variables.

1 Like

well very good to know!
Can you recommend a good resource to dive more into this topic (model evaluation with LR, penalization of LR, overfitting etc.)?
And are there any papers/books which I could look into, and later refer to them to have a foundation for in which cases it is overfitting or not and so on? I think I understand the principle, but without references it is always difficult to have an argument in discussion with seniors. :see_no_evil:
Thank you very much!

1 Like

Https://hbiostat.org/rmsc esp chapters 4 5 7 are good starting points.

1 Like

Dear Frank,

My question builds on your advice in #5 and #30 from above:

“Create two models that are nested within the super (grand) model that contains both measures A and B. You get a clear answer with a likelihood ratio χ2 test if A adds to B but B does not add to A. If they both add to each other then you have evidence of needing both measurements.”

We would like to know which of the non-nested models A or B is the stronger predictor when performed alone (cannot do both A and B in patients). For example adding B to A may lead to a higher increase in likelihood scores in the super model than the increase observed when adding A to B. How to test that the numerical difference in the increases in likelihood scores is statistically significant?

Thanks for your time and advice, Claus

I guess you could get a bootstrap confidence interval for any statistic representing a difference. But I don’t think of this so much as a place where “statistical significance” is very interesting. Rather I would estimate the amount of improvement. A potentially good way to do this, with uncertainty intervals, is relative explained variation, covered here.

Dear Dr. Harrell,

I like your proposed idea of using partial likelihood ratio test chisquare statistics to assess feature importance in a model (i.e. divide partial LRT chisquare by overall model chisquare for each feature).

I’m curious what your thoughts are on using partial R-squared or partial pseudo R-squared to assess the relative feature importance of each feature in a GLM?

For example, here is one approach for partial R2s in GLMs:

Zhang, Dabao. “A coefficient of determination for generalized linear models.” The American Statistician 71.4 (2017): 310-316.

People have developed pseudo R2 for GLMM as well:

Nakagawa, Shinichi, and Holger Schielzeth. “A general and simple method for obtaining R2 from generalized linear mixed‐effects models.” Methods in ecology and evolution 4.2 (2013): 133-142.

I have two thoughts.

  • As detailed here adjusted (for complexity/overfitting) R^2 may be easily obtained from partial likelihood ratio \chi^2 statistics
  • By linking the linear predictor in your general linear model to an ordinary linear model you can use ordinal partial R^2 to compute relative explained variation as detailed here.

For models fitted using maximum likelihood estimation, can compute the partial likelihood ratio statistic for each set and divide that by the total model LR (Source)

I like this idea a lot since it seems really easy to implement in practice for any likelihood based model. Correct me if I’m wrong, but is the idea to compute the LRT chi-square statistics for each individual predictor, then divide the chi-square by the overall model chi-square?

For example:

suppressPackageStartupMessages(library(rms));

dd <- datadist(mtcars);
options(datadist = 'dd');

f <- orm(mpg ~ hp + vs + am, data = mtcars, x = TRUE, y = TRUE);

# LRT of each feature
lrt <- anova(f, test='LR')
lrt.df <- as.data.frame(as.matrix(lrt));

# divide partial LRT chisquare by overall model chi-square
lrt.df$feature.importance <- lrt.df$`Chi-Square` / f$stats['Model L.R.'];
lrt.df
#>       Chi-Square d.f.            P feature.importance
#> hp     22.844885    1 1.756161e-06         0.39444013
#> vs      4.373859    1 3.649426e-02         0.07551912
#> am     20.787326    1 5.132148e-06         0.35891429
#> TOTAL  57.917242    3 1.637135e-12         1.00000000

Created on 2024-10-13 with reprex v2.1.1