Comparisons of Physicians and LLMs with Technical Replicates

Summary of Question: How to compare performance of a large language model (LLM) and physicians on some medical task, using multiple technical replicates from the LLM?

I’ve noticed several studies comparing LLMs and physician performance on medical licensing exams, and the authors will use multiple repeat runs of the model (sometimes 100s) for each question to capture variability in model responses for the same question. The authors will then analyze all responses, including the 100s of technical replicates from the LLM per question, as if they are independent observations. This violates the independence assumption of many statistical tests, but what is the preferred analysis?

I’m thinking of using a multilevel logistic regression model with random effects for individuals and question. The outcome would be binary right vs wrong. I would treat the LLM as just another individual, but one in which there are 100 observations per question while the human individuals only have one observation per question. Any other thoughts?

That approach would probably worked. What we used in BBR was the cluster bootstrap to account for such things.

1 Like