I posted a question to cross validated that might be a better fit for datamethods. I’m planning a study that will evaluate a new method of identifying cases in need of treatment. Let’s say we’ll evaluate 100 people with the new method, and the sample consists of 35 true cases and 65 non-cases as judged by an existing gold standard.[1]

In practice, I might want to argue that the new method is only worth adopting if a metric like sensitivity (or specificity or accuracy) is greater than some threshold, e.g., 0.7.

If I ran the study and had 15 false positives and 10 false negatives, the point estimate of sensitivity would be 0.71 with a 95% CI of 0.56-0.86. So maybe I could say, “Under repeated sampling, the true value will be between 0.56 and 0.86 95% of the time”, but I’m wondering if there is a method to make a different statement.

Given a point estimate of 0.71, I want to say that there is an X% probability that the sensitivity of the new method is at least 0.70 (the actual threshold will be set higher, but 0.70 seemed to work for this example). Is this possible?

[1] The first part of the study looks a bit like a case-control design on the surface. The new method will be used to identify 50 cases and 50 non-cases. We will then check the “true” status of all 100 by assessing each with an existing gold standard. So we’ll end up with confusion matrix of correct and incorrect labels. The sample for assessing sensitivity and specificity will not be a random sample, but we have other data on the prevalence of cases (10% defined by the gold standard).

I think the sensitivity/specificity way of thinking are getting in the way of understanding diagnostic/predictive value of a marker, and they force the marker to be binary and the outcome to be binary. Binary markers are actually very uncommon. I also advise that you spend significant time understanding and encoding what other information has to do with predicting the outcome. Then measure how much the new marker moves the information/improves predictions over and above existing information. This is exemplified here.

Thanks, @f2harrell. I’ve learned a lot from your writing about sensitivity/specificity and the hegemony of dichotomania. I might be still thinking about this the wrong way, but I still see value in the basic question.

We have a situation where clinicians make a decision to recommend treatment (yes/no) based on a clinical interview (depression, no biomarkers). Let’s call this the gold standard criteria/method for making this decision.

Now there is a different method that also results in a yes/no treatment recommendation based on “expert” opinion, but here the experts have less training but are more plentiful (a common “task shifting” paradigm in low-income countries where there might be 200,000 people for every 1 professional).

So, at the most basic level, I want to compare the recommendations from these lower skilled providers to recommendations from the “professional” experts. I don’t want to take a frequentist approach. Instead, I’d rather be able to estimate the probability that sensitivity is above a threshold.

I think of this as the start of the investigation, not the end, so your main points are (I think) well taken.

Under repeated sampling, we would find confidence intervals, of which 95% (in case of alfa=0.05) would contain the true mean. This is one of those intervals.

Why is this different? Look at a die you roll. Correct statement, “assuming the die is fair: if I roll this die countless times, the probability of rolling is 6 is 1/6”

Incorrect statement: “I rolled a 5. The probability that this is a 6, is 1/6.” No, after you have rolled the die and saw the result, the probability of a 6 is either 1 (you rolled a 6) or zero (you rolled something else).

2. If the true mean would be in interval [a, b], then this sample mean does not differ significantly from that true mean. Unfortunately we don’t know if the true mean really is between a and b.