De-dichotimization of diagnostic testing for increased confidence

Hi all,

I’ve been thinking about the sensitivity/specificity issue and how some efforts are hampered by the fact that we are, generally, working in a low-prevalence situation. An issue that have been talked about a fair amount with respect to Sars-CoV-2 (“C19”) antibody testing is finding out who has antibodies both in the general population and healthcare workers in specific. Whether or not so-called immunity passports are a good or bad thing is not a part of this exploration, and nor is the question of whether having antibodies actually confers immunity and for how long. I’m strictly thinking of the questions regarding how to reduce false positives and false negatives in these types of diagnostic tests.

Sensitivity vs Specificity
The fundamental problem, as I see it, is that sensitivity and specificity trade off each other. Given how the diagnostic problem is currently approached, as one goes up, the other must necessarily go down. Consider the following hypothetical example for an antibody test that has the ability to give quantitative results. There are two groups, each having 300 samples: a) A “does not have C19 antibodies” group called NoC19, and b) a “Does have C19 antibodies” group called YesC19. This quantitative test is run for the 600 samples, and the data is graphed below, using whatever units for the Y-axis would be appropriate:
So, the samples that are known to be C19 antibody free (red dots) are mostly around 0 units, but go up to almost 10,000 units. Conversely, samples known to have antibodies (green dots) range from about 8,000 to 30,000 units.

Now, in order to minimize false positives, one might choose the horizontal blue line as the line of demarcation when declaring a positive or negative status. Any (future) test of a sample with unknown status will be declared negative if it falls below the blue line, and positive if it is above the blue line. Unfortunately, this means that the YesC19 group includes 2 green points that are below the blue line, and thus declared negative. So, 0.67% of this sample would be false negatives.

Ok, so on the other side of the coin, the false negative problem could be solved by using the pink line as the line of demarcation instead of the blue line. This does successfully eliminate false negatives from this sample. However, it also causes there to be 3 of the red NoC19 points to be declared as positives, for a false positive rate of 1%.

Proposal: De-dichotomization by adding an Undetermined bin
Given the current way that this type of diagnostic testing is done there is no way essentially eliminate both false positives and false negatives. I believe this can be addressed to a substantial degree by de-dichotomizing the analysis. At the very least, and what I propose here, is to add a 3rd category in which results are considered Undetermined. What if the following 3-part decision were the case?

  1. If a result is above the blue line, it is considered a Positive
  2. If a result is below the purple line, it is considered a Negative
  3. If a result is between the two lines, it is considered an Undetermined

The advantage of this scheme is that it would produce no false positives and no false negatives, at least for the sample, and could severely reduce the frequency of false positives and negatives in future testing. The disadvantage (or maybe not?) is that a substantial chunk of tests might have to go into the Undetermined bin. How big of a proportion? That would have to be determined by the data from the actual technologies. Plus, since we’re working with samples, we’d want to account for the fact that the samples don’t capture the population of tests to be done in the future, so that uncertainty would have to be accounted for as well. I’m confident that there are experts who have sufficient scientific and statistical modeling expertise to be able to determine the best ways to derive an assumed distribution for both NoC19 and YesC19, and how to appropriately set limits.

So, what if, say, 10% of future test results would be expected to fall into the Undetermined bin. Would this be a worthwhile tradeoff? The real answer is, of course, “it depends,” but I’m optimistic that we could get sufficiently accurate tests without sacrificing too much into the Undetermined bin.

At the heart of this are the questions that everyone wants answers to beyond a reasonable doubt. They want to know:

IF a test tells me I have antibodies, then I can rely on it
IF a test tells me I do not have antibodies, then I can rely on it

In reality there’s no way to drive this to a true 100.0000% accuracy for both of these questions, so someone would need to decide what degree of confidence counts as “yes you can rely on it.” Let’s say that’s done by a panel of experts. Even if we had to put 20% of the tests into the Undecided bin to achieve whatever accuracy limits are declared good enough, that is still 80% that would have an answer with an extremely high degree of confidence.

Bad data is worse than no data
So, what about this 20% (or so)? Above I put a question mark after stating that having a category of Undetermined cases was a disadvantage. Sure, the best thing would be to have a clear decision for every single test, but that’s not possible if you want to have a high confidence for every decision that you declare. One of the phrases I like is that bad data is worse than no data, because it masks the fact that you don’t know what you think you know. You know? Better to leave something unknown than be forced to label it simply because that’s your only option.

Another thing to consider is that perhaps the only thing that most of these Undetermined tests require is follow-up testing with a more expensive/resource intensive test that can tease out more reliable answers for these cases.

Finally, these Undetermined cases could be used for Failure Analysis as part of a Continuous Improvement Process. What was the root cause(s) for these samples not being clearly in the Yes or No bins? If we could get to root causes, we could improve the testing technologies based on those findings and further increase our accuracy.

What about non-quantitative tests? (POC)
So far I’ve only discussed testing with quantitative results. Could the same approach be used for POC tests that use lateral flow assays and give qualitative results, like a pregnancy test? Consider a device that has two runways instead of one:
a) The left runway is highly sensitive for the presence of antibodies. If this runway is unable to detect antibodies, then we can have a very high degree of confidence that there are no antibodies present. This corresponds to the pink line and below in the quantitative example.
b) The right runway is only lightly sensitive for the presence of antibodies. If this runway is able to detect antibodies, it is because there is a substantial amount in the sample, and we can have a very high degree of confidence that there are antibodies present. This corresponds to the blue line and above in the quantitative example.

What about the marginal case? If the left runway detects antibodies and the right runway does not detect antibodies, then the result is Undetermined, and the next step might be a lab-based test with greater precision.
I am not an expert in any of these areas, but I thought I had an interesting idea and would like your feedback. Thanks.


The necessity of an “intermediate zone” depends on the science. COVID19, to the best of my knowledge, has no added value of saying, “Hmm, you’re in the middle”. Any clinician would probably declare the patient to be antibody negative because the risk of saying otherwise is far greater: presumed immunity, immunity passports, plasma donations, etc. because these are all costs that hinge on the value of the model.

One immediate challenge is that we now have a double-optimization problem. The model need not only maximize sensitivity/specificity, but also render as few in the middle as possible: this is called stratification. About 20(?) years ago, Pencina proposed a whole bunch of metrics, net reclassificaiton index, etc. for inspecting this. It fell flat because there was no general practical connection, and it depends on far more prespecified parameters for a classifier to be called “good”. If we view it as a general methodological challenge, an intermediate result is only worth acknowledging if it results in an intermediate action: repeat screening, performing a more costly gold standard diagnostic, administering a less harmful or risky treatment, or something else.

If we’re honest with ourselves, the thing people liked so much about AUC optimization is that you’re exonerated from having to quantify the relative costs of false positives and false negatives. With a nice AUC, you got away with saying, “Look I gave you a nice curve, now you decide where to cut it off.” It’s not good practice, it’s just pervasive.


My colleagues and I have written a paper “Binary classification models with “Uncertain” predictions” in which we have analysed the inclusion of an intermediate category which we called Uncertain, but I prefer more the Undetermined as it is mentioned here.

We have found that the inclusion of a third category may improve misclassification but not the sensitivity/specificity. Maybe our datasets were not representative.

In my experience people initially like the idea of the third category, but then we have a problem of defining false-positive and false-negative.

I fully support the idea of de-dichotomization, but have found that people are very suspicious about its implementation. Hopefully I am mistaken.

Looking forward to hearing other thoughts.

1 Like

You ought to check out the literature for annual tuberculosis screenings for US healthcare workers. The Quantiferon and TSPOT tests are Inteferon-gamma release ELISA assays, as are some of the better SARS-CoV-2 tests. Similar low prevalence, although the practical outcome is treatment, which is worth thinking about here - it’s cheap and not very risky to treat someone for latent TB, whereas the COVID tests could be about policy or about certifying people’s immunity.

Ordering a higher-precision test is great… if it exists. The testing protocol for HIV is usually a rapid POC test (maybe ELISA), with positives confirmed by Western blot, but didn’t seem to be a thing for latent tuberculosis. Another issue is the potential for between-test variability in the person’s immune response, which is a moving target. Oof.

Sorry I don’t have a specific paper to recommend, but I’ve been out of the field for years.

1 Like

Hi Adam, I disagree that there’s no value of a system that tells some people that they’re in the middle. I see two things that I think have value:

  1. The value would be in the fact that the test scheme no longer groups ambiguous results with high-evidence results. Those people that get a “you tested positive” or “you tested negative” result could have much greater confidence that their result represents reality and isn’t leading them astray. (this is what I consider to be the most valuable)
  2. For the people in the middle, yes, absolutely agree, their actions should be those of the “no antibodies” group until such time as they are able to get a better test. I don’t know why you wouldn’t consider that valuable, though. It would let them know that they should get a better test, as opposed to those that test negative and have no reason to get a better test. To me, that represents some value.

Let me also put it like this. A colleage of mine who I’ve discussed this idea with said:
“For me personally I’d rather take a test with a 20% chance of an inconclusive result and next to 0 chance of false +/-, than one with a 0% inconclusive rate but 1% false positive or negative.”

I feel exactly the same way. If you had to choose, which would you choose?

As far as the issue of a clinician declaring someone with an Undetermined result to be antibody negative: well, if they got an Undetermined result, then that’s what the clinicial should tell them. They should absolutely not lie and tell them they got a negative result. If a test is Undetermined then it’s deceptive to declare it negative (or positive, obviously). If a clinician tried to do that to me, they would have a serious problem on their hands.

Frankly, the more I think about your response the more troubled I am. A test presents three possible results: Positive, Negative, Undetermined, and a patient gets an Undetermined result. “Any clinician would probably declare the patient to be antibody negative because…” Is it really the case that clinicians widely believe that it’s acceptable to lie to their patients in order to nudge them towards the appropriate behaviors? Did I misunderstand what you’re saying? I hope so, because if clinicians really do generally believe that, then wow that would be very disturbing. Personally I’ve always had good relationships with my doctors and never got a hint of that, but I’m just a patient so I could just be quite ignorant.

Thanks for the feedback. I appreciate it. I’m really curious, too, how you’d answer which test you’d rather take yourself, if you are inclined to answer. (and anyone else, please answer as well!)

Hi Damjan, thanks for the link to your paper. I’ve read the abstract, and it sounds like the same idea that I’ve had. Pretty cool.

Great, thanks for the recommendations, Matt.

I too am interested in this, as the institute I work in has set up a RT-PCR (so antigen, rather than antibody, but similar arguments apply) testing facility. Most of my stats experience is directed to basic biological science rather than medical diagnosis, but I’m quite uneasy that there’s a fixed threshold in the original protocol which is “determined using the receiver operator characteristic curves of the tested clinical samples”. I hesitate even to mention ROCs for a cohort in this forum.

I’m inclined to attempt some “probability of being positive” approach to avoid premature dichotomization - @f2harrell in the Diagnostic chapter of his wonderful BBR notes makes me think this might be a fruitful approach. In the RT-PCR case, I think there’s even more scope for improvement, as the ‘Ct’ that is used for the decision is a result of an (often implicit) curve-fitting procedure, so we (should) also have a per-patient goodness of fit characteristic (or a standard error on that Ct) - I have a lingering worry that failing to take this into account is leading to elevated false-negatives (Table 12 of that document from BGI I linked to earlier seems to have a higher incidence than I’d expect): if a growth-curve shows a good fit, but is not steep enough, then that may well be indicative of low (but non-zero) levels of antigen?

As I say, I’m not a medical statistician, so before I shout louder in my own institute about this, I thought I’d let this opinion be shot down here by the experts first - hope that’s ok?

Surprised that no one has brought up transposed conditionals yet which I think is the biggest issue with sens/spec values. These metrics tell you about the chance of the test being positive/negative given that the patient is positive/negative which is not very useful for clinical decision making. What you are interested in is P(patient being positive | test result).
Than there is the issue of dichotomization of a continuous variable which is obviously bad practice, but as far as I see, your proposal is to tri-chotomize it instead. This approach still suffers from the same problems of information loss.
The best way would be to use the raw values from the quantitative test to determine the posterior probability for the particular patient. These probabilities can be used afterwards appropriately in a separate decision making context, eg. posterior probabily of >99% getting a “passport”, 90-99% getting re-testing, etc. But this is a completely separate policy-level discussion which is now appropriately informed by the test statistics.
As a side-note, all of this discussion presupposes a gold standard which could be used to determine if patients in the derivation cohort are truly negative or positive. I am not aware of any such method, unless you can recruit some “known negative” patients (known positives are not so difficult to find nowadays, however their antibody response could be variable)

1 Like

That’s what I was struggling to convey, with my “probability of being positive” comment - it does seem absurd that no-one’s really doing that; I guess I’m just going to try to demand I get the data in my institute. The gold-standard for validation of antigen detection is, roughly, sequencing the sample against the viral genome along with clinical presentation. Antibody is substantially more problematic.