Clustered data to develop algorithm assessing rare skin disease

We are studying a horrible rare skin disease where the affected body surface area (BSA) is a key parameter used in treatment decisions and prediction models. We believe that we have an approach to train neural networks to estimate the BSA if enough training images are provided. We will report the network performance in terms of the error it makes in reporting BSA compared to expert analysis of the images.

To develop this, we have three major global centers that are able to send us images, but the population characteristics are quite different, and we know the appearance is very different in dark vs light skin. What are the recommendations to split this three-center population into training and held-out test sets?

Specifically, let’s say the centers have the following patient populations:
Center A: 300 light-skinned patients
Center B: 200 light, 100 dark-skinned patients
Center C*: 100 light, 50 dark-skinned patients
*Center C is a referral center specializing in an unusual-appearing variant, let’s say the “shiny” disease variant, so is highly enriched for this type of disease.

Because this is not a simple regression model, but rather a time-consuming process to train the network, we can probably do only about a dozen different rounds of training and testing of the network (experiments). We would report each experiment’s performance separately for light or dark-skinned patients, and also for shiny vs regular variant.

My initial thought on the splits would be something like the following 11 experiments:
Pooled experiments: train on random sample of 80% of all pooled data; test on remaining 20% (repeat in 5-fold cross-validation)

Single training sets: train on center A; test on centers B and C (repeat for each center as a training set)

Leave-one-center out: train on centers A and B; test on center C (repeat all 3 possibilities)

Ultimately, we would release the network to the world, which was trained on the full set of pooled data. The important question for a future center using the network is what kind of performance they might expect. Would the best prediction for performance in a random dark-skinned patient in the future new center be the pooled cross-validation experiment above? Would the best prediction for performance in a random dark-skinned patient with shiny disease variant similarly be taken from subgroup analysis of the pooled cross-validation experiment?

I am wondering if the results of the center level training / testing could help us estimate how our network trained on say 20 centers might perform when applied to yet another center’s population.

Thank you for any thoughts!

1 Like

This is a good question. I don’t have an answer, but I have a general opinion that splitting on things known to be different is a recipe for failure to validate. Contrast that with explicitly modeling center or skin tone effects. In statistical models this would be done by pre-specifying selected interactions to include.

In my opinion, I will use an additional reference standard as a “gold” standard for the testing of network performance, e.g. three independent experts conduct the analyses of all patient documentation and images, and you can use that assessment as a gold standard to compare the performance of NN and expert as well (because the single expert can make mistakes as well). As a performance measure, I will recommend using the benefit-harm ratio for misclassified (consequences for false positive/false negative patients) rather than standard machine learning performance measures. It is much more clinically meaningful than using, e.g. harmonic mean. In other words, maximising the harmonic mean will not always bring you to the best balance of clinical consequences for patients (in many cases, you can have better clinical performance at lower levels of spec/sens).
Exploratively, in your situation, I will conduct an unsupervised learning analysis (if possible) to understand clusters better. Maybe that can help you make a more informed decision for the data split.

Thank you Vlad - how do you determine false positive on a continuous variable (in this case the outcome of interest is the % of body surface area affected)?

@f2harrell – I would guess a mean squared error might be useful if the outcome is a % and if the “prediction” is a percent. Not so much of a “false positive” but more of a standard actual vs predicted/calibration problem (similar to an ordinary least squares problem).

1 Like