Crowdsourcing best-practices alternative methods for CARG-BC clinical prediction model

The CARG-BC clinical prediction model published recently in JCO [1] strikes me as methodologically flawed on multiple levels, even decision-theoretically and ethically. In this post, I would hope to explore specifically the gap between the methods employed and ‘best-practices’ prediction modeling such as might be done at the leading edge of ‘mainstream biostats’ as represented in this forum.

The paper is paywalled, and not even on Sci-Hub yet. But this screenshot of p.3 exposes most of the statistical suboptimalities that I can see, and it seems to fall within ‘fair use’ to post it:

I of course don’t even like the problem framing—why passively predict toxicities when you could actively avert them?!—but taking the authors’ clinical prediction problem for granted, what would be an ideal statistical approach?

  1. Magnuson A, Sedrak MS, Gross CP, et al. Development and Validation of a Risk Tool for Predicting Severe Toxicity in Older Adults Receiving Chemotherapy for Early-Stage Breast Cancer. J Clin Oncol. Published online January 14, 2021:JCO2002063. doi:10.1200/JCO.20.02063 PMID: 33444080

See also several images from this tweet by Sumanta Pal:


In case it helps spark conversation, the model development and validation featured:

  • Dichotomous everything, on both sides of the regression
  • Stepwise variable selection
  • No calibration plot shown [thanks to @scboone for correction below!]
  • Hosmer-Lemeschow goodness-of-fit
  • Temporal splitting of data, apparently to enable claim of ‘external’ validation (vis-à-vis presumed secular trends) despite the availability of multiple centers in this multicenter trial
  • A non-reproducible model selection procedure
  • Subjective model selection (per ‘clinical relevance’ of included variables) after automated model-proliferation process
  • No effort to shrink model parameter estimates AFAICT

Browsed through the article. Some other things that I was wondering about:

  • No deaths outside of one from adverse events? Only 1 death is reported as AE but no more people died? I find this hard to believe in a population \geq 65 years old with mean age of 70. (In turn, this makes me wonder whether a more survival analysis-oriented modeling strategy might not be appropriate in this setting?). Edit: never mind this, the treatment period is quite short so this is probably not an issue.
  • No model intercept published: just the ORs for the predictors and the simplified scoring system that is based on the model.
  • Calibration plot is included but all the way at the end and only for the development cohort.
  • In line with the lack of assessment of calibration in the validation sample, all judgment of model function ultimately seems to be based on the AUC?

Very good points here, Sebastian — and thank you especially for correcting me on calibration plot. TRIPOD indeed requests published intercepts, as seen in this excerpt from Table 1 in [2]:

In fact, the authors cite [2] (their ref. 17) as justification for the temporal splitting of data. Here is a quote from the relevant section [2,p.W3], where this practice is termed ‘temporal or narrow validation’:

“External validation may use participant data collected by the same investigators, typically using the same predictor and outcome definitions and measurements, but sampled from a later period (tem- poral or narrow validation); by other investigators in another hospital or country (though disappointingly rare [27]), sometimes using different definitions and measurements (geographic or broad validation); in similar participants, but from an intentionally different setting (for example, a model developed in secondary care and assessed in similar participants, but selected from primary care); or even in other types of participants (for example, model developed in adults and as- sessed in children, or developed for predicting fatal events and assessed for predicting nonfatal events) (19, 20, 26, 28–30).”

Indeed, I see a section of [2] that presents an example of reporting backward stepwise without any hint of disparaging the practice:

This raises the question, whether the TRIPOD Statement, by focusing on accurately describing whatever has been done—even if it is bad—might be read as tacitly endorsing the bad things.

  1. Moons KGM, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Annals of Internal Medicine. 2015;162(1):W1. doi:10.7326/M14-0698
1 Like

Hope the discussion picks up some more. But I was thinking of your last statement

What I’ve run into a few times recently are prediction studies that reference the TRIPOD statement but then seem to ignore a lot of its advice and simply use the TRIPOD checklist as some sort of seal of approval/quality. Frequently they include the TRIPOD checklist at the end of the manuscript with details on the location of various items in their manuscript, e.g.:

But this is sort of the end of their adherence to the TRIPOD statement. They just fill out the squares, attach the checklist and that is it. I already have several recent references like this quoting TRIPOD and its recommendations for reporting but omitting important information such as model equations and key information on statistical analyses.

1 Like

TRIPOD group seem to be open to that type of criticism:

1 Like

When TRIPOD was being formulated I strongly voiced the reservation that bad statistical practice was not being prescribed against in the guidelines. I lost the argument.