At least in clinical medicine, there’s a large problem where RCT findings are left to be interpreted by the eye of the beholder. I am wondering if there’s anything already written that defines the appropriate way to interpret clinical trial results (primary, secondary outcomes, and secondary analyses). I tend to have the following dichotomization in my mind when interpreting trial findings:
- Positive intervention with a clinically meaningful improvement in a patient outcome of importance.
a. cost-effective intervention
b. not cost-effective - Positive intervention with minimal or not a meaningful improvement in a patient outcome (not cost-effective).
- Absence of Evidence
a. underpowered/low event rate
b. large sample, but strong signal not detected. Unlikely to be clinically meaningful for future study - Null (CI within clinically meaningless range) - rarely reported.
- Harmful
a. Minimal clinical harm
b. Serious clinical harm
Next, I usually feel secondary results or post-hoc analyses will either provide strong evidence, weak evidence, hypothesis generating for further evaluation, or no evidence. In an RCT setting, I tend to look for signals of statistical interaction that are biologically plausible in per-specified sub-population as weak evidence for further study.
My question is has anything been written that provides systematic guidance on interpretation and the language used to interpret trial results? Also, how does a frequentist or Bayesian framework influence our approach to interpretation of secondary outcomes.
I accept S.Greenland et al. Nature dichotomization of statistical significance statement. But within an RCT framework, we need to have decision limits. If the primary outcome requires an a priori alpha of 0.05 and we measure a signal at p=0.051, then we should not provide a gray-zone of acceptability. Bayesian adaptive designs obviously help with continued trial recruitment in face of potentially promising results.
The inspiration for my post are the recent ISCHEMIA trial results where one secondary outcome (of many) that was positive for quality of life (QOL) were emphasized. Additionally, there has been talk that the definitions for MI that were used were problematic and if we accounted for differences in severity of MI we might be able to show meaningful improvements in MI (or composite CV Death/MI, despite CV death not being appreciably different between treatment arms). I consider all the above hypothesis generating.
For QOL, the study was unblinded so concern for bias when discussing symptom burden is a concern.
For the definition of MI, changing definitions of endpoints and recombining them is concerning for post-hoc multiple testing. One might be able to shift thresholds enough to demonstrate a benefit for PCI. If strong argument is made to revise methods that were flawed in a primary analytic plan and that changes our measures of treatment effects, do we evaluate the data in a case by case basis or should all such analyses be considered “hypothesis generating” requiring further demonstration of validity. I think it’s helpful to have written guidance for how one should approach study result findings and use consistent language. Otherwise, we’re constantly inundated with personal preferences and spinning of results.
Appreciate others thoughts or references to consider.