"Observed Power" and other "Power" Issues

@ADAlthousePhD thanks for noting that and tagging me about it. The study in question (Effect of Escitalopram vs Placebo Treatment for Depression on Long-term Cardiac Outcomes in Patients With Acute Coronary SyndromeA Randomized Clinical Trial) basically used the calculated hazard ratio (HR:0.69) to calculate their achieved power to detect differences in long-term major adverse cardiac events (MACE). They concluded that they’d achieved 89% power for their study, which is higher than their conservative calculations,

“Assuming 2-sided tests, α = .05, and a follow-up sample size of 300, the expected power was 70% and 96% for detecting 10% and 15% group differences, respectively.”

^^^ That’s what they wrote in the methods section for MACE incidence as an outcome. Then they wrote this in the results section,

“A significant difference was found: composite MACE incidence was 40.9% (61/149) in the escitalopram group and 53.6% (81/151) in the placebo group (hazard ratio [HR], 0.69; 95% CI, 0.49-0.96; P = .03). The model assumption was met (Schoenfeld P = .48). The estimated statistical power to detect the observed difference in MACE incidence rates between the 2 groups was 89.7%.”

Here’s what I wrote in my letter to the editor which ended up getting rejected (no surprise at all there).

Kim et al (1) present the data from a randomized trial showing that treatment with escitalopram for
24 weeks lowered the risk of major adverse cardiac events (MACE) in depressed patients following
recent acute coronary syndrome. The authors should be commended for exploring new approaches to
reduce major cardiac events. However, in their paper, they seem to misunderstand the purpose of
statistical power. In hypothesis testing, power is the probability of correctly rejecting the null hypothesis,
given that the alternative hypothesis is true. (2) Power analyses are useful for designing studies but lose their utility after the study data have been analyzed. (3)

In the design phase, Kim et al used data from the Korea Acute Myocardial Infarction Registry
(KAMIR) study and expected the power of their study to be 70% and 96% for detecting 10% and 15%
group differences, respectively, with a sample of 300 participants. This is an appropriate use of power
analysis, often referred to as “design power.” It is based on the study design and beliefs about the true
population structure. After completing the study and analyzing the data, the authors calculated how much power their test had based on the hazard ratio (HR=0.69). They estimated that their test had 89.7% power to detect between-group differences in MACE incidence rates. This is referred to as “observed power” because it is calculated from the observed study data. (3) This form of power analysis is misleading.

One cannot know the true power of a statistical test because there is no way to be sure that the
effect size estimate from the study (HR=0.69) is the true parameter. Therefore, one cannot be certain that the observed power (89.7%) is the true power of the test. Observed power can often mislead researchers into having a false sense of confidence in their results. Furthermore, observed power, which is a 1:1 function of the p-value, yields no relevant information not already provided by the p-value. (3,4)

Statistical power is useful for designing future studies, but once the data from a study have been
produced, it is the effect sizes, the p-values, and the confidence intervals that contain insightful
information. Observed power adds little to this and can be misleading. (3) If one wishes to see how uncertain (3) they should be about their point estimate, then they should look at the width of the confidence intervals and the values they cover. (2)

References:

  1. Kim J-M, Stewart R, Lee Y-S, et al. Effect of Escitalopram vs Placebo Treatment for Depression on Long-term Cardiac Outcomes in Patients With Acute Coronary Syndrome: A Randomized Clinical Trial. JAMA. 2018;320(4):350-358. doi:10.1001/jama.2018.9422

  2. Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-350. doi:10.1007/s10654-016-0149-3

  3. Hoenig JM, Heisey DM. The Abuse of Power. Am Stat. 2001;55(1):19-24.
    doi:10.1198/000313001300339897

  4. Greenland S. Nonsignificance plus high power does not imply support for the null over the
    alternative. Ann Epidemiol. 2012;22(5):364-368. doi:10.1016/j.annepidem.2012.02.007

9 Likes