About External Validation

Hi everyone,
I have three datasets: A, B, and C. Datasets A and B come from the same source, while dataset C comes from a different source. I applied unsupervised dimensionality reduction using all three datasets and then trained and developed the model using dataset A. I subsequently evaluated the model using datasets B and C.
Would using dataset C in this context be considered external validation?
Thank you in advance for your insights!

You cannot use all 3 datasets to do dimensionality reduction; you have to do it only in the training set (A,B) - probably just A actually if you want to have an internal test set B.

Otherwise if C is from a different source, it would appear to be external validation.

1 Like

But there are some cases with doing data reduction on all data combined is OK, i.e., any instability in data reduction patterns does not systematically bias predictive modeling once the outcome is revealed.

Whether C gives you an external validation depends on how C was selected.

Thank you, Dr. Harrell,
Just a question.
What are those conditions?
C is a regional representative cohort that is temporarily and otherwise independent of dataset A & B and is being used only for data dimensionality reduction and then the external validation.

Thanks

Isn’t C informative enough, including estimating time trends, to include it in model development?

You’re right! Apologies. I think I just had a knee-jerk reaction. For example, consider the unsupervised dimensionality reduction: just taking the first covariate.

It’s informative, but since we needed to have an external validation, we aimed to use it as external validation. Can we use it as external validation in this way?

Why did you need an external validation? Did you do a sample size calculation to show that both the training sample size and the external validation sample size are adequate?