Multiple-rater multiple-case studies

Does anyone have recommendation for analysis methods for comparing the average physician performance vs. some machine learning model in terms of the difference in F1-score or accuracy? The study design would be a multiple-rater multiple-case study with each case reviewed by several (but only a subset of) physicians as well as the machine learning model.

All the literature I could find, if at all, covers AuROC curve analyses for this scenario (not even sure whether I can get a difference). As far as I could see, this is because that’s what the FDA asks for when asssessing new radiology/imaging methods so that other questions seem a lot less well investigated.

As far as I understand, naively calculating the metric of interest might be about right for an estimate, but confidence intervals and hypothesis tests are quite hard to get (and simply boostrapping cases, as far as I can figure out does not reflec the sources of variation). I have an idea for a fancy Bayesian random effects model (I can write it down as a frequentist model, but seem to have no chance, at all, to fit it via maximum likelihood), but wanted to check I am not missing some obvious approach here.

Any suggestions or pointers would be appreciated.

1 Like

F1 and accuracy are improper accuracy scoring rules, i.e., they are optimized by a bogus model. See here for more details.

The most flexible and interpretable method for observer agreement studies is based on U-statistics using all possible pairs of different observers for inter-person comparisons and all possible pairs of observations from the same observer for intra-person comparisons. This is described in detail in the observer agreement chapter of BBR. However, I’ve only worked this out for scalar measures on individual patient data. For example if you wanted to compare the Brier score (a proper scoring rule for binary Y) you could possibly do this on the per-patient contributions to the Brier score (squared errors). Or you may be able to use this method on group-level data with some work.