Power calculation for limits of agreement of imaging algorithm vs expert, to be applied with a lay-human in the loop

Would really appreciate anyone’s advice on how to fairly estimate performance of a measurement algorithm when a human-in-the-loop will determine the times it is applied.

Specifics:
We have preliminary data for a deep learning image analysis algorithm that helps providers estimate the fraction of abnormal skin in a high fidelity, calibrated 3D photo by marking it out and reporting the % of the photographed body surface (%BSA) affected. We also have ground truth of the true affected region from a highly trained human. Performance is measured simply as the difference in %BSA by algorithm from human, with limits of agreement (LoA) per Bland-Altman (1986 Lancet), and we are considering the power calculation proposed by Lu (PMID 27838682) for a larger study.

Performance overall is quite good, with the mean absolute error low. However, LoA are very wide due to a few outliers, where the algorithm provides obviously erroneous results with completely unrealistic markings of the skin. In practice, we envision a lay human using the algorithm simply flagging the output as unrealistic, and referring to standard of care as if the algorithm had never existed.

Given this intended use scenario, does it sound reasonable to have a lay human go through our preliminary data and throw out any algorithm outputs that are obviously unrealistic?

Will appreciate any recommendations on the best ways to power image analysis algorithm study design to different use scenarios, based on preliminary data!

Also appreciate any pointers in the direction of literature on evaluating agreement that takes spatial information into account. The machinery seems to apply nicely to the outcome of surface area differences, but this does not for example differentiate between completely missing one of the lesions to mark vs just having wider/narrower borders on a correctly identified lesion. Dice/Jaccard index has the same issue!

I don’t know about the spatial aspect but I’d focus on mean absolute error to not allow extreme values to have a serious effect. If you still have a robustness problem then median absolute error might be considered. For related material see the observer agreement chapter in BBR.

Thank you Frank! I have been a long-time student of Chapter 16 of BBR! Is there an accepted method for power calculations on the mean absolute difference? In our case of comparing two methods, I believe the absolute difference is likely to follow a half-normal distribution.

For anyone else following this thread, Zhouwen Liu from Vanderbilt has an R software including confidence limits for more complex cases ( mean pairwise difference u-statistic ):
https://biostat.app.vumc.org/wiki/Main/AnalysisOfObserverVariability

1 Like

I don’t have a method for computing power, but power isn’t really so appropriate for this problem which is more of an estimation problem. The method I fall back on is computing the confidence intervals from one study, figuring out by what factor it needs to be shorter, squaring that, and multiplying the one study’s sample size by that.

1 Like