Monte Carlo simulation for sensitivity and specificity determination

I am a veterinarian looking to do a diagnostic accuracy meta-analysis. Some papers report sensitivity and specificity but many do not. They do however report the mean and SD of the biomarker in healthy and diseased animals. Is it possible to set a diagnostic threshold and then perform a simulation with these parameters to calculated a sensitivity and specificity? Many thanks for your statistical expertise.

1 Like

This is a good question. I think the statistically minded would avoid any type of threshold, and recommend keeping the data on a continuous scale. Instead of a “diseased/not diseased” classification, the ideal method would be to output a probability of disease.

But the data you have available might have been improperly dichotomized, making it less useful.

If you haven’t done so already, check out Chapter 18 in Prof. Harrell’s Biostatistics for Biomedical research aka. BBR (first item on the link provided). He goes into mathematical detail why looking for “cut points” or thresholds is arbitrary. It wastes about a third of the data in the best case. The link has a list of many of the free publications that he has kindly made available on the web.

You might also find his section on sensitivity and specificity, and their problems useful.

I was coincidentally thinking about a similar problem. I do not have any good ideas at the moment. My intuition suggests to use a bootstrap by creating synthetic, independent samples derived from the reported means/variances of the diseased and not diseased subjects in the studies, respectively.

The Probit regression model seems to be the most appropriate one to use in your particular case, if I understand Prof. Harrell’s writings correctly.

I am making the assumption that the studies you have available are reasonable estimates of the hypothetical population value, and that this is an appropriate application of the “plug in principle” described in Bradley Efron and Robert Tibshirani in Introduction to the Bootstrap.

Suppose we have 5 studies, that have 5 pairs of means/variances where D are the reported mean/variance for diseased subjects, and N are not diseased:
(D_{1}[\bar{x}, s^2], N_{1}[\bar{x}, s^2]) ... (D_{5}[\bar{x}, s^2], N_{5}[\bar{x}, s^2])

If the numbers of studies are small, you pair the data from D_{1} with each N_{1} ... N_{5} and create synthetic data for each group, pool it, then run a regression on the synthetic data. Store this result Do this for all possible pairs of diseased/not diseased, then look at the distribution of the bootstrap regression.

If the number of studies is too large (likely more than 10), then just randomly sample from the data indicating disease and not diseased.

Perhaps the experts here might reply after me bumping this. Is my intuition on the bootstrap reasonable here? If not, where did I go wrong?

Edit: Lecture by Dr. Harrell on diagnostic testing


Thank you for this excellent lecture. Another thought I had was if it is possible or if anyone has already run simulations looking at the impact of various biases in study design on diagnostic test accuracy studies. For example looking at spectrum bias or imperfect reference standards?

Imperfect reference standards is a big issue no matter how one approaches diagnostic research. But other issues are man-made. For example, workup (referral) bias requires complex adjustments to sensitivity and specificity but no adjustment when you stick to directly estimating the probability of disease and condition on the covariates that caused the workup bias.