We are studying a horrible rare skin disease where the affected body surface area (BSA) is a key parameter used in treatment decisions and prediction models. We believe that we have an approach to train neural networks to estimate the BSA if enough training images are provided. We will report the network performance in terms of the error it makes in reporting BSA compared to expert analysis of the images.
To develop this, we have three major global centers that are able to send us images, but the population characteristics are quite different, and we know the appearance is very different in dark vs light skin. What are the recommendations to split this three-center population into training and held-out test sets?
Specifically, let’s say the centers have the following patient populations:
Center A: 300 light-skinned patients
Center B: 200 light, 100 dark-skinned patients
Center C*: 100 light, 50 dark-skinned patients
*Center C is a referral center specializing in an unusual-appearing variant, let’s say the “shiny” disease variant, so is highly enriched for this type of disease.
Because this is not a simple regression model, but rather a time-consuming process to train the network, we can probably do only about a dozen different rounds of training and testing of the network (experiments). We would report each experiment’s performance separately for light or dark-skinned patients, and also for shiny vs regular variant.
My initial thought on the splits would be something like the following 11 experiments:
Pooled experiments: train on random sample of 80% of all pooled data; test on remaining 20% (repeat in 5-fold cross-validation)
Single training sets: train on center A; test on centers B and C (repeat for each center as a training set)
Leave-one-center out: train on centers A and B; test on center C (repeat all 3 possibilities)
Ultimately, we would release the network to the world, which was trained on the full set of pooled data. The important question for a future center using the network is what kind of performance they might expect. Would the best prediction for performance in a random dark-skinned patient in the future new center be the pooled cross-validation experiment above? Would the best prediction for performance in a random dark-skinned patient with shiny disease variant similarly be taken from subgroup analysis of the pooled cross-validation experiment?
I am wondering if the results of the center level training / testing could help us estimate how our network trained on say 20 centers might perform when applied to yet another center’s population.
Thank you for any thoughts!