Interpretation of RCTs: Evidence vs. Hypothesis Generating

At least in clinical medicine, there’s a large problem where RCT findings are left to be interpreted by the eye of the beholder. I am wondering if there’s anything already written that defines the appropriate way to interpret clinical trial results (primary, secondary outcomes, and secondary analyses). I tend to have the following dichotomization in my mind when interpreting trial findings:

  1. Positive intervention with a clinically meaningful improvement in a patient outcome of importance.
    a. cost-effective intervention
    b. not cost-effective
  2. Positive intervention with minimal or not a meaningful improvement in a patient outcome (not cost-effective).
  3. Absence of Evidence
    a. underpowered/low event rate
    b. large sample, but strong signal not detected. Unlikely to be clinically meaningful for future study
  4. Null (CI within clinically meaningless range) - rarely reported.
  5. Harmful
    a. Minimal clinical harm
    b. Serious clinical harm

Next, I usually feel secondary results or post-hoc analyses will either provide strong evidence, weak evidence, hypothesis generating for further evaluation, or no evidence. In an RCT setting, I tend to look for signals of statistical interaction that are biologically plausible in per-specified sub-population as weak evidence for further study.

My question is has anything been written that provides systematic guidance on interpretation and the language used to interpret trial results? Also, how does a frequentist or Bayesian framework influence our approach to interpretation of secondary outcomes.

I accept S.Greenland et al. Nature dichotomization of statistical significance statement. But within an RCT framework, we need to have decision limits. If the primary outcome requires an a priori alpha of 0.05 and we measure a signal at p=0.051, then we should not provide a gray-zone of acceptability. Bayesian adaptive designs obviously help with continued trial recruitment in face of potentially promising results.

The inspiration for my post are the recent ISCHEMIA trial results where one secondary outcome (of many) that was positive for quality of life (QOL) were emphasized. Additionally, there has been talk that the definitions for MI that were used were problematic and if we accounted for differences in severity of MI we might be able to show meaningful improvements in MI (or composite CV Death/MI, despite CV death not being appreciably different between treatment arms). I consider all the above hypothesis generating.

For QOL, the study was unblinded so concern for bias when discussing symptom burden is a concern.

For the definition of MI, changing definitions of endpoints and recombining them is concerning for post-hoc multiple testing. One might be able to shift thresholds enough to demonstrate a benefit for PCI. If strong argument is made to revise methods that were flawed in a primary analytic plan and that changes our measures of treatment effects, do we evaluate the data in a case by case basis or should all such analyses be considered “hypothesis generating” requiring further demonstration of validity. I think it’s helpful to have written guidance for how one should approach study result findings and use consistent language. Otherwise, we’re constantly inundated with personal preferences and spinning of results.

Appreciate others thoughts or references to consider.


FDA guidance documents can be pretty useful:

1 Like

Great thoughts.

Part of the difficulty in appraising/interpreting studies for clinical context is the absence of a pre-defined minimally clinically important difference (MICD).

If any effect size greater than zero is sufficient to justify an intervention, then no study will ever be able to completely “rule out” possible benefit.

At some point, we have to say “improvement less than x% absolute or relative difference” means the studied intervention probably does not provide sufficient clinical benefit.

At the end of the day, if folks are interpreting RCTs without predefining the MCID, then expect the goal posts to continue to change if the results don’t go their way.

As for communicating study results, I would always start by asking “what is the MCID”? Is it one value? Do clinicians/patients/etc have different MCID? How do we interpret the confidence interval & results given our clinical expectations?

At the end of the day, If we can’t define or agree upon an MCID, then we should not be surprised that results can be spun as desired.

( For reference, I made a similar point in a previous post in response to Dr. Mandrola comments on Orbita.)


RCT design papers should have a highlighted section on the MCID and what the references are for that decision. Especially as you say this does vary by patient based on their value. I am always struck by the number of older patients in clinic that don’t care for additional life years and care more outcomes related more to QOL generally.


The ISCHEMIA design paper makes no mention of MICD. There’s a power statement:

“To achieve this sample it was estimated that more than 10,000 participants would be enrolled, accounting for screen failures. The sample size was estimated to provide 90% power to detect a 15% relative reduction in the primary composite endpoint assuming the primary endpoint occurs within 4 years in 20% of the conservative strategy group and 17% of the invasive strategy group. An annual rate of
CV death or MI in patients with at least moderate ischemia was estimated to be ~5% using data from the COURAGE trial9 and several observational stress imaging registries”

Is a 0.75% ARR/15% RR of a composite clinically meaningful?


I suppose extrapolating power-calc as surrogate for the MCID is the best we can do for Ischemia, but so many issues there. And as you rightfully point out, is the “0.75% ARR/15% RR” composite meaningful? How does that translate into actual survival gains for patients?

It’s surprising that consideration of patient-specified value is not a standardized part of clinical trial design. I think it speaks volumes about the paternalism in medicine, that we can spend $$$ looking for answers w/o first exploring if it matters to our patients.

How many bits of information are you getting out of a trial when you pidgeonhole its results in this way? I count 8 separate categories here. Max entropy for 8 categories is 1 byte. ISCHEMIA sounds pretty expensive; what was its cost per bit under this interpretive scheme?

I don’t think in terms of bytes. I’m not sure what you’re getting at. The trial was designed for one research question (1 byte?). Does revascularization if modérate to severe ischemic disease (minus left main stenosis) reduce mortality and MI. The secondary outcomes were numerous and pre-specified sub-groups for interactions.

1 Like

You had mentioned your concurrence with [1] I think, which mentions the surprisal S = - \log_2(P) which [2,3] elaborate more helpfully as a measure of information that data provide against a hypothesis. (BTW, I carelessly inflated the informational content of your 8 categories; since they are mutually exclusive, they encode only \log_2(8) = 3 bits of information.)

It seems extraordinary to me that we extol these monster RCTs as “gold-standard” and what-not, when they yield (as generally interpreted by those who most celebrate them) only a few coin-flips’ worth of information.

Do you think this perspective undermines the interpretive scheme you have laid out?

  1. Amrhein V, Greenland S. Remove, rather than redefine, statistical significance. Nature Human Behaviour. September 2017. doi:10.1038/s41562-017-0224-0

  2. Greenland S. Invited Commentary: The Need for Cognitive Science in Methodology. Am J Epidemiol. 2017:1-7. doi:10.1093/aje/kwx259

  3. Chow ZR, Greenland S. Semantic and Cognitive Tools to Aid Statistical Inference: Replace Confidence and Significance by Compatibility and Surprise. arXiv:190908579 [q-bio, stat]. September 2019. Accessed October 1, 2019.

1 Like