Does anyone have recommendation for analysis methods for comparing the average physician performance vs. some machine learning model in terms of the difference in F1-score or accuracy? The study design would be a multiple-rater multiple-case study with each case reviewed by several (but only a subset of) physicians as well as the machine learning model.

All the literature I could find, if at all, covers AuROC curve analyses for this scenario (not even sure whether I can get a difference). As far as I could see, this is because that’s what the FDA asks for when asssessing new radiology/imaging methods so that other questions seem a lot less well investigated.

As far as I understand, naively calculating the metric of interest might be about right for an estimate, but confidence intervals and hypothesis tests are quite hard to get (and simply boostrapping cases, as far as I can figure out does not reflec the sources of variation). I have an idea for a fancy Bayesian random effects model (I can write it down as a frequentist model, but seem to have no chance, at all, to fit it via maximum likelihood), but wanted to check I am not missing some obvious approach here.

Any suggestions or pointers would be appreciated.