About External Validation

ArminAll · February 19, 2025, 12:37pm

Hi everyone,
I have three datasets: A, B, and C. Datasets A and B come from the same source, while dataset C comes from a different source. I applied unsupervised dimensionality reduction using all three datasets and then trained and developed the model using dataset A. I subsequently evaluated the model using datasets B and C.
Would using dataset C in this context be considered external validation?
Thank you in advance for your insights!

samw235711 · February 19, 2025, 1:40pm

You cannot use all 3 datasets to do dimensionality reduction; you have to do it only in the training set (A,B) - probably just A actually if you want to have an internal test set B.

Otherwise if C is from a different source, it would appear to be external validation.

f2harrell · February 20, 2025, 2:08pm

But there are some cases with doing data reduction on all data combined is OK, i.e., any instability in data reduction patterns does not systematically bias predictive modeling once the outcome is revealed.

Whether C gives you an external validation depends on how C was selected.

ArminAll · February 20, 2025, 5:21pm

Thank you, Dr. Harrell,
Just a question.
What are those conditions?
C is a regional representative cohort that is temporarily and otherwise independent of dataset A & B and is being used only for data dimensionality reduction and then the external validation.

Thanks

f2harrell · February 20, 2025, 11:49pm

Isn’t C informative enough, including estimating time trends, to include it in model development?

samw235711 · February 21, 2025, 12:07am

You’re right! Apologies. I think I just had a knee-jerk reaction. For example, consider the unsupervised dimensionality reduction: just taking the first covariate.

ArminAll · February 21, 2025, 1:36pm

It’s informative, but since we needed to have an external validation, we aimed to use it as external validation. Can we use it as external validation in this way?

f2harrell · February 24, 2025, 12:26pm

Why did you need an external validation? Did you do a sample size calculation to show that both the training sample size and the external validation sample size are adequate?