External validation or a bigger modelling sample?

I am an oncologist trying to build a robust clinical prediction model to predict outcomes in operated oral cancer.
Based on a review of the literature

  • I have 10 candidate predictor parameters of which 3 are continuous variables.
  • Other published models have a published c-index of between 0.67 to 0.78. Relatively few have published their LR.
  • Based on the 2019 recommendations and using the pmsampsize package I need between 642 to 800 cases and approximately 160-200 events.

I have 920 cases from my institution, but I am concerned about external validation. Fortunately, many colleagues are happy to join and contribute their data. I am expecting another 600-900 cases with follow-up from them over the next year.

My question is that should I:

  1. Develop my own model and validate on the external dataset?
  2. Combine the entire dataset for modelling and then use bootstrapping for validation to get optimism correction for the model evaluation criteria?
  3. Or use a combination of both in some established way (that I am ignorant about)?

As you can understand, I have no formal statistical background but I know that we must first ask the statisticians before embarking on this perilous journey. Will be grateful to receive any critical feedback.

Good question
First, I recommend the paper by Steyerberg and Harrell https://www.jclinepi.com/article/S0895-4356(15)00175-4/fulltext It gives the answer - both internal validation and external.

The number of events is key. If some of your candidate predictors are ordinal, or you need to use splines for your continuous variables then you will have a lot of degrees of freedom (more than 10) which will require more events. So, be careful here.

Checking both predictive performance and calibration in an external validation data set is key. I’ve seen where the gross discrimination metrics look good, but calibration is poor - meaning re-calibration likely necessary (one can also, then, make a model using all the data). In terms of diagnostic utility, check performance within a context of who you expect clinician’s to use it. (eg are they using it to decide on who to treat/not treat and what sort of “risks” or a poor outcome are the prepared to have cf the “risks” of the operation &/or not treating).

Hope this helps and all the best with the work.


Many thanks. This is a very useful article. I could try to use the internal external cross validation in the model development phase.
A further clarification:

  • A practical problem might be that some centers being big and others small, different runs of the cross validation might have different sample sizes and therefore the results may be confusing.
  • What is essentially the benefit of taking this approach instead of taking all the cases in model development and using the center (or its treatment pattern) as a covariate? To make myself clearer: Let’s say that out of 4 centers, 1 uses modern RT techniques for all patients and has a surgical nodal yield average of 40 per case, and another center mainly used simpler radiation techniques and has an acceptable but lower lymph node yield of 25 per case on average. I do expect these differences to lead to a small impact on survival. If I do center by center cross validation I will get more variability in my metrics, but if I can just combine the cases and use these factors as covariates with a larger sample size and just do bootstrapping, how much do I lose out? Especially so since my final model after internal external cross validation will include data from all centers anyway.

Interesting discussion, thanks for posting.

Your last two points remind me of two development+validation papers:

  1. ISARIC 4C Deterioration model (in-hospital deterioration COVID-19 patients): they used random-effects meta-analysis for internal-external validation. This is nice because you can have an idea of the range of discrimination/calibration/net benefit metrics for a new, randomly sampled center.

  2. ADNEX model (risk of ovarian cancer before surgery): they used the center as a random effect, so they could account for center-specific survival and predict for new centers using partial pooling. They performed temporal validation and, because the model was calibrated, they updated it using all available data.

Hope those are helpful.


Thank you. This is very useful.

1 Like

Remember that external validation that is not based on a huge sample can be misleading. That’s why resampling is usually a better approach (100 repeats of 10-fold cross-validation or 400 bootstrap resamples).

Sharing a link with relevant papers on that in case it is helpful (Sample size for external validation of a prognostic model section):

1 Like

Thanks Prof. Therefore if I don’t have a big enough sample size from an external institute as suggested by the validation sample size paper by Riley et al, I should include all patients and do a 400 bootstrap resample?

Thanks @giuliano-cruz. This is a goldmine.

1 Like

Yes I would do stroing internal validation, making sure to repeat all supervised learning steps afresh, hundreds of times, which also fully exposes instability in any variable selection used (variable selection is not advised at any rate). Its goal is to estimate likely future model performance from the same stream of patients. And if you include patients from different geographical areas or health systems in the model development and validation yoou can model geographical effects. Same for secular trends, by splining absolute calendar time.