This is a topic for questions, answers, and discussions about sessions 14-15 of the Biostatistics for Biomedical Research web course airing on 2020-03-13 and for session 15 interactive Q & A 2020-04-03… Session topics are listed here . The session covers multiple linear regression and dangers of percentiling.
I am almost up to speed with the BBR course and recently encountered your discussion about internal / external validity. This issue has come up in many contexts since I have started following BBR and datamethods, eg regarding heterogeneity of treatment effects, observational vs randomized studies, etc. If I got your position right, you seem to advocate for using the full original dataset for derivation of the predictive model. I would like to hear your opinion about one of the most highly regarded studies in my field (pediatric emergency medicine), the PECARN head injury study. This was a multi-center, prospective observation study which included:
- 42.412 children with the outcome of interest occurring in 376 (0,9%)
- They divided the patients into two cohorts: 10.718 under 2 yrs & 31.694 2 yrs and older
- Approximately 20% of the patients were used for validation (2.216/10.718 and 6.411/31.694)
- The authors used binary recursive partitioning to derive the predictive model
- 10-fold cross validation was used for internal validation
The prediction rule has since been validated by at least two other independent datasets, both of which had a smaller sample size than the original 20% validation groups. It is unlikely that any larger dataset would be available since the original paper included the largest multi-center collaborative in North America.
My question is two-fold: if you have your hands on the largest dataset ever to be available, in all likelihood, would it not be prudent research practice to still achieve some level of external validation, instead of just using your entire dataset for derivation?
My second Q is: would it be possible to conduct a bootstrap-type process for the external validation part? I mean you have argued that by partitioning the data, one only does external validation of an example model. What if you repeated the entire process of partitioning -> derivation -> internal validation -> external validation a few hundred times by partitioning at 20% randomly? Would that not be able to mitigate both issues, ie over-fitting and arriving at an example model?
There are so many problems with the authors’ approach that it is very difficult to know where to begin …
The number of events is too small to do data splitting. 10-fold cross-validation must be repeated many times to be reliable. Splitting the data by age is not a good idea. Recursive partitioning requires 3-fold larger sample size than what was available. I doubt that the model was truly validated. Show me a no-binning smooth calibration curve and I might be convinced otherwise.
Since the effective sample size is more like the number of events than the number of patients, this is not a large dataset. Anything that makes it effectively even smaller (e.g. data splitting) should be avoided.
To answer your question about the prudence of external validation what do you mean by external? And if the original sample size is insufficient why would that external data source not be used in model development anyway?
The bootstrap only does internal validation (though it does it well) and there is no place for also doing a 20% partitioning when using the bootstrap.
I hope @Ewout_Steyerberg sees this. I’m sure he’ll have a lot to add.
Thank you so much for your answer, I understand that your time and expertise is in great demand these days so I really appreciate it. One clarification on the age-splitting is that this is actually motivated from a clinical perspective, small children present with different signs & symptoms which cannot be evaluated in older children, such as an open fontanelle. But I understand your point that any splitting leading to an even smaller sample size compromises model validity.
Regarding your other points I suspect you are right as well, but I am unfortunately lacking the modeling knowledge to reflect upon them. I have started reading @Ewout_Steyerberg’s great book, in conjunction with the BBR course (regarding which I just wanted to note that I am missing the opportunity listen to the episodes podcast-style).
What I am wondering now is of course how different (if at all) would this prediction model look like, if derived according to the suggested best practices? And also, on a more theoretical level, when can a model be really called validated?
In clinical practice we usually don’t start using a clinical decision rule until a separate validation study has been published but in cases such as this, the validation studies may have an even smaller sample size than the derivation cohort. And the performance of a model such as this, where the goal is to maximize negative predictive value for a rare event, it is very unlikely that any individual physician or even clinic will see patients miss-classified as false negative by an imprecise model. I.e. one might inadvertently buy into a model which doesn’t do exactly what one expects it to do, having a lower-than-advertised NPV, but still bringing considerable benefits (decreased downstream testing). The pragmatic solution, instead of rejecting this “example” model, might be to still employ it, reap the benefits, but still maintain a sliver of skepticism, for cases where thing don’t really fit. Paraphrasing George E.P. Box, all models are wrong, many are even inappropriately derived, but even these can potentially be useful. Do you have any comments on this pragmatic approach?
Re section 10.7.5 (Hypothesis Testing) - Is the partial F test for comparing a full and reduced model the same as a “chunk test” that I have seen you refer to elsewhere?
It’s a pretty good approach. I just hesitate on the use of “negative predictive value” because formally it is only to be used for a truly binary all-or-nothing test output. And different decision makers have different risk thresholds. So keeping everything continuous until the last moment pays off.
Thank you for a really enjoyable course so far! I am currently catching up on the material in my own time and I was very interested in interaction effects discussed in Chapter 15, and how we can interpret hazard ratios when we are dealing with a continuous measure affecting the outcome differently across groups. This is something I rarely see in published models and often wonder what the overall influence would be if it was accounted for.
I now really want to be able to understand how to interpret the case of a continuous measure affecting the outcome (time-to-event) differently dependent on the value of another continuous variable. Is there any material you can recommend for understanding this or tips for the best way to approach this idea? I feel I probably need a full overview at this stage but I do have a few specific questions and apologise if they lack basic understanding.
In a cox proportional hazards model where the coefficient on the interaction term will describe whether there is an increase or decrease relative to the effect of the variable, can you only get an overall measure of effect of one continuous variable at given values of the other?
How do we approach this if we know a higher value of both variables is what leads to worse outcome but we do not know the shape of this relationship, or whether any thresholds exist in advance. Could we explore this with interaction diagrams or would this require stratifying one of the continuous measures (something we usually want to avoid)?
And lastly, how do you interpret any significance testing that is reported on the main and interaction terms in the model?
I hope this is clear but do let me know if I have not explained this well.
Hi Angie - these are great questions. I cover this in some detail in Regression Modeling Strategies - see especially the course notes, Section 2.3.2
When the sample size is not small, allowing a continuous \times continuous interaction to be flexible using a tensor spline is advised. This is best presented as a color image plot similar to a heatmap.
See here for some related ideas that allow these complex relationships to be examined in more detail. But a starting point is the amount of log-likelihood due to specific types of terms (interaction; nonlinear interaction, etc.). You can approximate that with the
anova function in the R
rms package which gives you chi-square explained by type of term in the model.
Emphasize the likelihood ratio \chi^2 tests. The hypothesis tests for interactions test for additivity/synergism/effect modification. Chunk tests for combined main effect + interaction tests test other interesting hypotheses. For example if you have a model with a spline in x1, spline in x2, and cross-product terms of these (to create a tensor spline), the overall x1 test tests whether x1 is associated with Y for any level of x2.