Interpreting a not quite negative result

Dear all,

I’m working on a subgroup analysis of an RCT for rapid identification of an important clinical feature. Sorry for being vague, but it’s sensitive (the main RCT has not yet been published). Essentially, there are two outcomes: The device states the patient has clinical state A (the severe phenotype), or clinical state B (the mild one).

        0   1   
   0  517   0
   1    0 191

AS you can, see, the test is very good. 100% sensitive, and 100% specific.

Now, I know, from clinical experience, that the machine doesn’t always work, and that our baseline prevelance of the severe phenotype is around 24%. If we look at situations where the machine doesn’t work, we get this table: (where 0 is, test didn’t work, 1, is test did work and is negative, 2 is test did work and is positive).

       0   1   2
   0 124 517   0

   1  13   0 191

Clearly, you can see that the post-test probability of the severe phenotype has reduced from ~24% pre test, to ~10% post-test, with a null result.

So, how do I report this? Clearly, I can generate a LR (in this case, a negative LR of just over 2), which might help clinicians, but I can’t see that generating a sensitivity or specificity table is of value here (we are only going to be interested when the test doens’t perform, who cares what the sens/spec of instrument failure is).

And if I am to generate a negative LR, how do i generate CI? By bootstrapping?

Interested in peoples thoughts on how to present this in a scientific and meaningful way.
Clearly, we have other predictors and subgroups within this dataset (it is a n ~ 2,500, multi-centre RCT), and there is significant variation in phenotype across them (i.e. the ICU has a very different level to the ED), although it does appear that the negative LR holds across all settings.

Thanks in advance - interested clinician, bad statistician!

I hope you don’t mind, I re-labelled the data to keep consistency between the two methods.

image

image

If the instrument fails 137/708 ~ 20% of the time there are most likely some larger concerns than just examining the sensitivity and specificity. Of more concern is that the number of positives (defined by reference) within the group that failed does not align with the estimated prevalence in the population (13/137 ~ 9.5% vs. 24%).

You are correct that the prevalence of the population will affect your analysis (especially PPV and NPV). Confidence intervals might help here (binomial exact for sens/spec and prevalence-adjusted confidence intervals for PPV and NPV).

When I have run studies in the past and have perfect high sensitivity and perfect specificity I am usually skeptical (for example - having spiked seropositive rather than real world samples).

Statistical Methods in Diagnostic Medicine by Zhou, Obuchowski, and McClish might shed some light, although I just looked through to see if they had any good examples of a similar situation with no such luck.

Perhaps instead of trying to infer some type of sensitivity of the assay you could just report what you observed. “Out of the 708 subjects in this subgroup, ~19.4% (137/708) failed by test method B. Of the 571 subjects with complete paired testing for reference method A and test method B, both methods found 191 positive results and 517 negative results. No discrepant results were observed in this sample.”

Thanks so much. I’m sorry I can’t be more clear - I’m actually analysing the subgroup of an RCT that is in the hanging commitee of a journal, so don’t want to prejudice that process at all.

Your comments are very sensible and make great sense. I don’t think it makes sense to use sens and spec in this context.

To explain, this test is already in routine practice, so stopping using it is unlikely. I’m just trying to see if more information can be gained by instrument failure, which does appear to be the case. FH