Determine power to detect a misclassification event in rule-out pathways for myocardial infarction

The question concerns work we have recently conducted, looking at risk prediction using a serum biomarker in a study population presenting with chest pain to emergency departments. In clinical medicine, biomarkers such as cardiac troponin are used to classify patients into three categories:

  1. MI ruled-out
  2. MI possible
  3. MI ruled-in

Now let’s say for sake of argument you have a study population of 1,000 patients; prevalence of MI is 17%. Across the entire cohort, MI is safely ruled-out in 36% at first blood draw; ruled-in in 12%, possible in the rest. The criteria for ‘safe’ rule-out is a negative predictive value (NPV) of ≥99.5% or sensitivity ≥99%. The cut-offs used for classification achieve the above performance targets, and misclassify no patient with MI to ‘rule-out’.

We know from prior evidence that patients presenting very early to the ED might yield a ‘false negative’ result on first blood draw (this has to do with the release kinetics of cardiac biomarkers - if we test too early, we might not have given the cardiac biomarker time to rise above detection/decision limits). 40% of patients in our cohort are such ‘early presenters’ - e.g. they had their first blood draw within 3 hours of chest pain onset. None of those were misclassified (falsely ruled-out) in our retrospective cohort study (as none in the entire cohort were misclassified either).

How would you go about describing the uncertainty surrounding this finding? Would you do a post-hoc power analysis to assess how likely it is you would have detected a change in NPV from 100% to e.g. 98%?

It is the use of thresholds that makes such markers less clinically useful (by e.g. not having a gray zone), creates false positives, and makes the test performance vary from population to population. Thresholds hinder progress, but clinically and research-wise. Relating continuous markers to probability of a final positive diagnosis would be so much better.


I know Frank, and I agree with you. Sadly, that’s not how clinical medicine operates at the moment. I was still interested in how you’d express the uncertainty surrounding a finding within a sub-population of your study cohort, hence my rather long explanation of the situation.

Consider the words of wisdom in these blog posts, and maybe the broader medical research community can move away from information destroying practices.

The first link appears most relevant to your question.

For background, 2 additional articles are worth study, which can be found here:

Addendum: I suspect the medical practitioners are (correctly) concerned with the suspicion that these models are too complicated in practice, where decisions need to be made in a short period of time. A good question for the statisticians and decision theorists would be:

How can probabilistic decision making be adapted to critical situations where there is no time for the physician to consult a model – (ie. emergency medicine), but quick action needs to be taken?


Kia ora Tom

Frank et al … I continue to work on the clinicians to allow me to prediction models without throwing away the information. However, the reality for MI prediction in the ED is that there are a couple of biomarkers around that on their own are strongly predictive so these methods of categorising patients to green, red and grey zone is remarkably effective.

Tom - So your NPV and Sens in your cohort are 100%? As bootstrapping won’t yield a confidence interval you could use something like the Wilson score method. I have a spread sheet for that - email me and I’ll send it to you. Something else I have done is to create multiple (500+) synthetic data sets (using regression techniques to predict new values of biomarkers and outcomes). This can be used to create CIs too. However, for your purposes the Wilson score method is probably better.

ps Tom - Sensitivity or “Sensitivity and NPV” - as I’ve argued elsewhere we must never rely on NPV alone because there is a lot of variation in prevalence around the world.

Thanks for your answer, John. You point out an important aspect here, the biomarker cut-offs are really used to operationalise the triage of patients with chest pain rather than to (truly) predict risk. For us biomarker nerds, this of course appears to be a waste of valuable information, but the average clinician wants to know where a patient ought to referred to (inpatient vs outpatient management, for a start). COVID has only worsened the pressures on the acute services to do this effectively.

As mentioned in my email, NPV and sensitivity are 100% due to the size (or lack of) of the validation cohort, and remarkably, no misclassifications (false negatives).
Your final point is important - reviewing my first post, I did indeed mean a sensitivity and NPV above those thresholds.

@R_cubed thanks for the links to these articles, I’m working my way through them as I’m writing this.

It’s not only a waste of information (requires more biomarkers to be measured to make up for the loss) but also it is easily shown that the thresholds are functions of other patient characteristics.

I don’t disagree, Frank! And that’s probably also one of the reasons why so many patients land in those grey zones of ‘indeterminate’ significance when algorithms are used with strict cutoff values - arguably, the 80-yr old patient with CKD is allowed to have a different troponin than the 35-yr old female with no prior history…

1 Like

Thank you, Tom, for bringing up this issue.

There is a difference – not always fully appreciated – between evaluations of drugs and evaluations of medical test, or test strategies.

In evaluations of drugs, the emphasis typically lies on the relative effect: drug versus comparator.

In evaluations of the clinical performance of medical tests, the emphasis typically lies on (pairs of) absolute proportions, such as the sensitivity and specificity, or the NPV and PPV, or NPV and overall proportion negatives (in rule-out tests).

The clinical population in which the drug/test will be applied typically differs from the study group, both in RCT of tests and in test accuracy studies.

I would therefore say that a report of an RCT is never fully informative for decision-making if it only reports the relative effect, or only the proportions, even if these are combined with expressions of statistical uncertainty.

In the evaluation of the troponin-based strategy to rule out MI, I would therefore like to see more information: the full (empirical) distribution in the study group of time to onset, the corresponding troponin results, and the corresponding (final) outcome. Preferably combined – if available and N is sufficient – with other baseline characteristics.

@Frank: when recommending the likely outcomes of a testing strategy, it is inevitable to define action thresholds. If combined with fully informative reporting, this do not hinder progress, as other action thresholds can be explored post-hoc (depending on study design).

When evaluating a test or testing strategy, I would also like to pre-specify the desired level of a key measure of performance (e.g. NPV 98%, or total failure rate, or total percentage negatives). Yes, I would also report the NPV with 95% confidence interval, but would not perform a post-hoc power analysis.


I’m having difficulty reconciling, in a rigorous statistical way, the above with the following insightful observation (also in the same post):

The clinical population in which the drug/test will be applied typically differs from the study group, both in RCT of tests and in test accuracy studies.

The participants in the RCTs will also differ from each other and are not sampled from any population, so arguments based on sample theory are out. But there is still valuable discriminant information in well done studies that can be useful.

Frank goes into detail in Chapter 18 of Biostatistcs for Biomedical Research on how data driven cut-points (aka decision thresholds) are doomed to not replicate. AFAICT, attempting to synthesize decisions instead of probabilities will add unnecessary statistical heterogeneity for any evidence synthesis of diagnostic studies, unless the signal is overwhelming. A study of threads by @llynn on the problems with 30+ years of sepis research would be valuable.

After much searching, I finally found this forest plot of 30 years of widely published sepsis studies. Credit also goes to Dr. Lawrence Lynn. This graphic really should be archived somewhere around here. A look shows nothing but heterogeneous control groups from the studies, centered around the null. The only improvement on the chart would be to list the studies in order of publication.

I concede that cut points for decision in practice are inevitable. But these are decisions based on cost/benefit (ie. utility) considerations, where values and assumptions should be explicit, with input from all stakeholders.

Utilities should not be smuggled in under the guise of “evidence synthesis”. This occurs when dichotomization of “objective” information is substituted for the raw information itself. That permits all sorts of gamesmanship that is well outside the domain of strict scientific assessment.

That task of determining decision thresholds is distinct from information collection and synthesis at the local study level or aggregate level, where mathematics and statistics can provide guidance.


What does diagnostic threshold mean? Deterministic and probabilistic considerations | SpringerLink