Would really appreciate anyone’s advice on how to fairly estimate performance of a measurement algorithm when a human-in-the-loop will determine the times it is applied.
Specifics:
We have preliminary data for a deep learning image analysis algorithm that helps providers estimate the fraction of abnormal skin in a high fidelity, calibrated 3D photo by marking it out and reporting the % of the photographed body surface (%BSA) affected. We also have ground truth of the true affected region from a highly trained human. Performance is measured simply as the difference in %BSA by algorithm from human, with limits of agreement (LoA) per Bland-Altman (1986 Lancet), and we are considering the power calculation proposed by Lu (PMID 27838682) for a larger study.
Performance overall is quite good, with the mean absolute error low. However, LoA are very wide due to a few outliers, where the algorithm provides obviously erroneous results with completely unrealistic markings of the skin. In practice, we envision a lay human using the algorithm simply flagging the output as unrealistic, and referring to standard of care as if the algorithm had never existed.
Given this intended use scenario, does it sound reasonable to have a lay human go through our preliminary data and throw out any algorithm outputs that are obviously unrealistic?
Will appreciate any recommendations on the best ways to power image analysis algorithm study design to different use scenarios, based on preliminary data!
Also appreciate any pointers in the direction of literature on evaluating agreement that takes spatial information into account. The machinery seems to apply nicely to the outcome of surface area differences, but this does not for example differentiate between completely missing one of the lesions to mark vs just having wider/narrower borders on a correctly identified lesion. Dice/Jaccard index has the same issue!