This is a topic for questions, answers, and discussions about sessions 14-15 of the Biostatistics for Biomedical Research web course airing on 2020-03-13 and for session 15 interactive Q & A 2020-04-03… Session topics are listed here . The session covers multiple linear regression and dangers of percentiling.
I am almost up to speed with the BBR course and recently encountered your discussion about internal / external validity. This issue has come up in many contexts since I have started following BBR and datamethods, eg regarding heterogeneity of treatment effects, observational vs randomized studies, etc. If I got your position right, you seem to advocate for using the full original dataset for derivation of the predictive model. I would like to hear your opinion about one of the most highly regarded studies in my field (pediatric emergency medicine), the PECARN head injury study. This was a multi-center, prospective observation study which included:
- 42.412 children with the outcome of interest occurring in 376 (0,9%)
- They divided the patients into two cohorts: 10.718 under 2 yrs & 31.694 2 yrs and older
- Approximately 20% of the patients were used for validation (2.216/10.718 and 6.411/31.694)
- The authors used binary recursive partitioning to derive the predictive model
- 10-fold cross validation was used for internal validation
The prediction rule has since been validated by at least two other independent datasets, both of which had a smaller sample size than the original 20% validation groups. It is unlikely that any larger dataset would be available since the original paper included the largest multi-center collaborative in North America.
My question is two-fold: if you have your hands on the largest dataset ever to be available, in all likelihood, would it not be prudent research practice to still achieve some level of external validation, instead of just using your entire dataset for derivation?
My second Q is: would it be possible to conduct a bootstrap-type process for the external validation part? I mean you have argued that by partitioning the data, one only does external validation of an example model. What if you repeated the entire process of partitioning -> derivation -> internal validation -> external validation a few hundred times by partitioning at 20% randomly? Would that not be able to mitigate both issues, ie over-fitting and arriving at an example model?
There are so many problems with the authors’ approach that it is very difficult to know where to begin …
The number of events is too small to do data splitting. 10-fold cross-validation must be repeated many times to be reliable. Splitting the data by age is not a good idea. Recursive partitioning requires 3-fold larger sample size than what was available. I doubt that the model was truly validated. Show me a no-binning smooth calibration curve and I might be convinced otherwise.
Since the effective sample size is more like the number of events than the number of patients, this is not a large dataset. Anything that makes it effectively even smaller (e.g. data splitting) should be avoided.
To answer your question about the prudence of external validation what do you mean by external? And if the original sample size is insufficient why would that external data source not be used in model development anyway?
The bootstrap only does internal validation (though it does it well) and there is no place for also doing a 20% partitioning when using the bootstrap.
I hope @Ewout_Steyerberg sees this. I’m sure he’ll have a lot to add.
Thank you so much for your answer, I understand that your time and expertise is in great demand these days so I really appreciate it. One clarification on the age-splitting is that this is actually motivated from a clinical perspective, small children present with different signs & symptoms which cannot be evaluated in older children, such as an open fontanelle. But I understand your point that any splitting leading to an even smaller sample size compromises model validity.
Regarding your other points I suspect you are right as well, but I am unfortunately lacking the modeling knowledge to reflect upon them. I have started reading @Ewout_Steyerberg’s great book, in conjunction with the BBR course (regarding which I just wanted to note that I am missing the opportunity listen to the episodes podcast-style).
What I am wondering now is of course how different (if at all) would this prediction model look like, if derived according to the suggested best practices? And also, on a more theoretical level, when can a model be really called validated?
In clinical practice we usually don’t start using a clinical decision rule until a separate validation study has been published but in cases such as this, the validation studies may have an even smaller sample size than the derivation cohort. And the performance of a model such as this, where the goal is to maximize negative predictive value for a rare event, it is very unlikely that any individual physician or even clinic will see patients miss-classified as false negative by an imprecise model. I.e. one might inadvertently buy into a model which doesn’t do exactly what one expects it to do, having a lower-than-advertised NPV, but still bringing considerable benefits (decreased downstream testing). The pragmatic solution, instead of rejecting this “example” model, might be to still employ it, reap the benefits, but still maintain a sliver of skepticism, for cases where thing don’t really fit. Paraphrasing George E.P. Box, all models are wrong, many are even inappropriately derived, but even these can potentially be useful. Do you have any comments on this pragmatic approach?