Method comparison study: Multiple experts vs AI

I think one of the hindrances to improving analysis in papers / statistical practice is the lack of canonical examples of real experimental data with associated commentary about why certain methods were / were not used.

My work is on applying AI within clinical practice - so though I would see if the community could give me some advice

In part this is motivated by being asked to follow this paper in a review of a manuscript - and to calculate the “Individual Equivalence Coefficient” - which to me seems overly complex.

Can the hivemind give me some hints?

I’ve put a Rpub up here with some data

Attached some data, and put a summary of issues here.

d_ai_new.rds (737 Bytes)
d_human_new.rds (5.1 KB)

Method comparison studies in medicine + AI

A common use for AI within medical imaging is to replace the part of reporting that involves experts making laborious manual measurements (e.g. size of a tumor, volume of organ) from images (X-rays / CT / MRI / Ultrasound). When you observe what experts do you quickly realize that there is a fair bit of variability in their clinical practice (either they are measuring to subtly different boundaries, the boundaries are indistinct, where they should measure is poorly defined in guidelines).

Consequently, for most validation AI vs expert measurement studies you end up with multiple expert reads on each image vs 1 AI measurement.

Despite this common experimental set-up for medical AI:

  1. There is remarkably little consistency in how papers report the performance of their model against multiple experts.
  2. There seem to be hundreds of different published methods extending Bland-Altman - but no advice on which to use.

I think you want three measures:

  1. A measure of the difference between AI and the expert consensus.
  2. A measure of the difference between the experts to help contextualize (1)
  3. A comparison of (1) and (2) in the hope:
  • that (1) is definitely better than (2)
  • or be sure that (1) is at least no worse than (2) (within some margin).

What not to do

  1. Correlation coefficient / ICC etc.
  2. Arbitrarily picking a number as a threshold and working out sens/spec.

Possible approaches

  1. Methods of Bland and Altman, with worked example here:
  • Only describes limits of agreement between the two methods - don’t get the three measures I would like.
  • IEC - Seems quite complex and fiddly - and I can’t find any public code for it - so would like some opinions before I try and implement it.
  • I like this, especially for the precision plot, but it doesn’t provide estimate and CI of the difference in precision.
  • Starts off well, but not sure how to extend into a method comparison study.

Data and code



I have been working on a few AI validation projects in the diagnostic imaging space and have been running into the similar problems with varying levels of agreement among the “experts”.

I’ve been exploring some of the methods outlined in Zhou’s “Statistical Methods for Diagnostic Medicine” for the handling of imperfect gold standards (also see Dawid 1977 and Water 1988).

You can create a “latent class” under the assumption of conditional independence using the expectation maximization algorithm.

I’ve fiddled around with bootstrapping the EM estimates to see the overall spread of sensitivity and specificity for each of the raters. Here is an example:

Hi mshunshin

Once upon a time I wrote and hosted an app to help me understand method comparison better (see the ‘No reps’ and ‘reps’ tab).

I found the methcomp R package useful in this regard, the approaches there can be extended to more than 2 measurement methods.

I hope this is helpful, good luck

You may need to start with images that are 5-10 years old, when the true clinical outcome (cancer, death, pneumonia) has been established. Note that when you have experts review the cases (not the images), 5-10% will not have an established truth of clinical outcome. A fair number are fuzzy at best. Then take your ~90% with a true climical outcome - the gold standard truth, and then test the images against experts and AI. It is not infrequent for experts to disagree at the margins. But if you have ground truth in the clinical cases (not the images), you have a chance to detect this. Statistical methods, no matter how good, will never help you until solve the ground truth problem.


@eamj Thanks for the direction. I have looked at the MethComp package and tried it out.

> head(d_methcomp%>%arrange(item))
# A tibble: 6 × 4
# Groups:   item [1]
  meth  item     repl     y
  <chr> <chr>   <int> <dbl>
1 ai    img-001     1  4.81
2 human img-001     2  4.53
3 human img-001     4  4.19
4 human img-001     5  4.04
5 human img-001     6  4.60
6 human img-001     7  4.15

and with the suggested model

> lme(y ~ meth + item,
+     random=list(item = pdIdent( ~ meth-1 )),
+     weights = varIdent( form = ~1 | meth ),
+     data=d_methcomp
+     )
Linear mixed-effects model fit by REML
  Data: d_methcomp 
  Log-restricted-likelihood: -489.4658
  Fixed: y ~ meth + item 
(Intercept)   methhuman itemimg-002 itemimg-003 itemimg-004 itemimg-005 itemimg-006 
 4.32773288 -0.06600693  0.88417259 -0.03237474 -1.31638503 -0.07865287 -0.78561175 
itemimg-007 itemimg-008 itemimg-009 itemimg-010 itemimg-011 itemimg-012 itemimg-013 
 0.84957061  0.01475476  0.11546577 -1.01228514  0.06149034 -0.53670715  0.87884802 
itemimg-014 itemimg-015 itemimg-016 itemimg-017 itemimg-018 itemimg-019 itemimg-020 
-0.16499688  0.85415626 -0.14669645  0.42124467 -1.37903863 -0.51299712 -0.93834730 
itemimg-021 itemimg-022 itemimg-023 itemimg-024 itemimg-025 itemimg-026 itemimg-027 
 0.64473509 -0.08180170  0.47431800 -0.40012037  0.97821558 -0.07318391  0.81090664 
itemimg-028 itemimg-029 itemimg-030 itemimg-031 itemimg-032 itemimg-033 itemimg-034 
-0.28933897  0.41430653 -0.60223291  0.48442269 -0.89387017 -0.32412920 -1.33625419 
itemimg-035 itemimg-036 itemimg-037 itemimg-038 itemimg-039 itemimg-040 itemimg-041 
 1.06660248  0.12757132  1.13667744  0.14641280  0.05633081 -0.23334604  0.47556763 
itemimg-042 itemimg-043 itemimg-044 itemimg-045 itemimg-046 itemimg-047 itemimg-048 
-1.03734473  1.31980072 -0.12618879  0.65694068 -0.79491924  0.27604629 -0.62978046 
itemimg-049 itemimg-050 
 0.53331925 -0.77510764 

Random effects:
 Formula: ~meth - 1 | item
 Structure: Multiple of an Identity
            methai  methhuman  Residual
StdDev: 0.05901417 0.05901417 0.2879305

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | meth 
 Parameter estimates:
      ai    human 
1.000000 1.726325 
Number of Observations: 668
Number of Groups: 50 

Perhaps my question is more about what estimands you would want for this experimental design (and how you can derive it from a model). 1) an error of the AI from the consensus of experts 2) error of a random expert. 3) the difference between the two + interval.

What about a Cohen’s kappa statistic? Did you think about it?

Also take a look at

1 Like

Thanks - the one difference is that if I use the suggestion from: 16.2 Comparison of Measurements with a Standard - what modification should I make when best standard is “defined” by the average of the human observers. This biases it against the AI.

Analyzing an average or a consensus will provide an upper bound on how good humans can be. Then the main analysis needs to include individual human assessments so that average absolute disagreements between AI and single human assessments can be estimated.

1 Like