Comparison of diagnostic tests performance

I recently came across an interesting problem that can be solved using mixed logistic model as I think. The problem postulates the following:
two diagnostic tests were evaluated relative to the “gold standard” on the same sample and we need to compare the performance of tests in terms of their sensitivity and specificity. As far as I understand, single mixed model can be applied, but I can’t understand how it should be formulated.

My first thought was the following:

glmer(T+ ~ test + (1|D+/PID), data, family = “binomial”)

But it seems to me that this option is not correct.

I am looking for opinions in order to understand how mixed models work.

1 Like

It’s hard to give a complete answer without more information, but since it seems like the data are paired you can use McNemar’s test for dependent proportions (e.g., sensitivity). The information on the WIkipedia page (in the Definition section and the exact P-value calculation under Variations) is essentially the same as what I read in Statistical Methods in Diagnostic Medicine, 2nd Edition (2011) by Zhou, Obuchowski, and McClish; specifically Chapter 5.1.1 in the text.

If the data are clustered, it is slightly more complicated. To borrow an example of clustered data from their text, if you were studying polyps such that the diagnostic unit of study was a single polyps, then the subject/patient constitutes a cluster.

Either way I don’t think a mixed effects model is necessary. I don’t think it would be since it doesn’t look like you’re adjusting for any covariates.

Should this not be a simple issue of comparing two ROC curves and their areas?

ROC curves, sensitivity, and specificity are all poorly suited for the diagnostic problem. See the Diagnosis chapter in BBR.

1 Like

Study designs for comparative diagnostic test accuracy: A methodological review and classification scheme - Journal of Clinical Epidemiology (

I agree and as you state in BBR:

The act of diagnosis requires that patients be placed in a binary category of either having or not having a certain disease. Accordingly, the diseases of particular concern for industrialized countries |such as type 2 diabetes, obesity, or depression| require that a somewhat arbitrary cut-point be chosen on a continuous scale of measurement (for example, a fasting glucose level > 6.9 mmol/L [> 125 mg/dL] for type 2 diabetes). These cut-points do not adequately reflect disease biology, may inappropriately treat patients on either side of the cut-point as 2 homogeneous risk groups, fail to incorporate other risk factors, and are invariable to patient preference.

But then we have the other issue - in order to proceed with treatment we need to make a diagnosis. So perhaps we should view this as making a decision rather than making a diagnosis - selection of a cut-point where diagnostic threshold probability is sufficient to allow decision making to proceed. The cutoff of >=7mmol/L for T2D, for example, is then seeking to classify patients into those where we should act in a specified way or not - calling this diabetes then allows definition of this “specified way” to proceed.

Perhaps then sensitivity should be viewed as Pr(action | benefit) and specificity as Pr(inaction | no benefit).

The cutoff of 7 for T2D is not backed up by any data anywhere. And the fact that actions need to be taken does not mean a threshold should be used. I asked a classroom of physicians why we have a threshold for diabetes and they said it’s required for billing purposes. Not for decision making.

What is needed is to act as if something is true without assuming it’s true. When I grab an umbrella in the morning after hearing a rainfall probability of 0.7 I don’t conclude that it’s going to rain. I just decided to act as if it will rain.


Granted fasting glucose indicates a continuum of risk increasing from say 5mmol/L all the way to 11mmol/L and beyond. So how should a clinician decide when to intervene without dichotomous thinking?

Ideally, that would be a function of individual patient characteristics and measurements. In a regulatory environment that was derived from statistical principles, the default objective function would be individualized ARR, where the physician uses models like the one described here:

I could see the use of a multi-objective function that incorporates patient preferences as well. There might be a need to trade off slight worsening on some outcomes to improve others, in the case of comorbidities. Multi-objective decision analysis has lots of interesting results.

I have written a paper that was motivated (HTE section) by Frank’s article that you have quoted. However such models may get us to the closest empiric estimate of individual treatment effect (on whatever metric), but then what? Getting from there to making a decision are two different things.

Like they always do. Physicians are smart in their clinical practice but often don’t chose analytical approaches that match their own practice. Cardiologists deciding whether to start a patient on statin whose LDL cholesterol is just above a guideline’s threshold will rightfully factor in other risk factors such smoking and sedentary lifestyle before making a recommendation. Ideally the physician-patient interaction would be risk-of-outcome-based and would never use a threshold on any one risk factors. Borderline effects add up; gray zones need to be respected.

Okay, then you agree that eventually … at the last point after considering the data and other risk factors a decision needs to be made - just that not based on a single threshold on one variable. I agree with that and when I see a patient with FBG say 6.8 mmol/L on separate occasions, I certainly will not dismiss them as non-diabetic and do nothing. I will consider them on the pathway to adverse consequences of hyperglycemia and based on other considerations may even consider taking an action that would perhaps be taken at a much higher FBG level that meets the threshold definition of diabetes. You are quite right, physicians certainly do not act based on thresholds alone but it seems to help decision making - perhaps a form of risk stratification in HTE parlance.

One of the many terrible things about thresholds is that I have personally witnessed physicians not dealing with pre-clinical worsening of parameters such as HbA1c or PSA. There may be more public health benefit from behavior modification/drugs before the clinical “threshold” is hit.

I agree with all the above. Mcnemar’s is a quick solution to see if the two tests are different from one another, but but it does not necessarily determine if it is better, in a 2x2 universe. In my practice of cancer (hard-dichotomous) , you can do an AUC comparison for obvious, first line reduction of the universe of models or tests, as we know how well in a generalized population current clinician or the next best test’s accuracy is. So for example, models or tests with AUC below 80 add nothing to current practice, unless they have enormous improvement in subpopulations e.g. small cell cancer, but we stated earlier that we are in the general application and so avoid that rabbit hole. As Frank states, and even in the case of lung cancer, clinicians assess a high, intermediate (I really don’t know, damned if I do and damned if I don’t) , low risk. High, get tissue, low, imaging surveillance, intermediate is the wild west, every doc for themselves, talk with the patient and together establish how risk averse the two are in their specific case. In this more realistic universe, a chi-square test for comparing AUC will say if the one is different from the other for discrimination, performing a generalization of Mcnemar’s to n x k will tell you if the two are different by correctly or incorrectly reclassifying individuals, but then a third approach is needed to determine which provides the best “clinically meaningful” improvement. For example, most AI in my field look great as they perform extremely well and consistently on the easy ones, stuff at the ends of the distribution. On the difficult one’s or those without a similar distribution in the original dataset, they fail abysmally, especially compared to clinicians. This then becomes a reclassification problem and, if possible, a comparative effectiveness question as well. To see a use case for this approach, Kammer et al 2022 AJCCM.. If you have to have an individual number describing reclassification efficacy, then bias corrected net reclassification index (Paytner 2013) will get you a number for each for how well they do overall in each of cases and, separately, controls.

1 Like

For most diseases a false positive or a false negative wouldnt be equally weighted so how does the AUC help? Isnt Decision Curve Analysis far more illuminating?