@zad, I saw this comment of yours on Andrew Gelman’s website and wondered if you would be willing to discuss here:
“I submitted something similar to JAMA last month regarding a trial that investigated the effects of an antidepressant for cardiovascular disease outcomes. The authors made power calculations via the effect sizes observed in the study and concluded that they had achieved more power than they had planned for and therefore put more confidence in their results. Unfortunately, it got rejected, as do most letter to the editors in JAMA about methodology and statistics”
Can you post the original article to which you refer, and maybe even an abridged version of your letter? I think it might make for a fertile discussion here about misconceptions of power calculations.
“Assuming 2-sided tests, α = .05, and a follow-up sample size of 300, the expected power was 70% and 96% for detecting 10% and 15% group differences, respectively.”
^^^ That’s what they wrote in the methods section for MACE incidence as an outcome. Then they wrote this in the results section,
“A significant difference was found: composite MACE incidence was 40.9% (61/149) in the escitalopram group and 53.6% (81/151) in the placebo group (hazard ratio [HR], 0.69; 95% CI, 0.49-0.96; P = .03). The model assumption was met (Schoenfeld P = .48). The estimated statistical power to detect the observed difference in MACE incidence rates between the 2 groups was 89.7%.”
Here’s what I wrote in my letter to the editor which ended up getting rejected (no surprise at all there).
Kim et al (1) present the data from a randomized trial showing that treatment with escitalopram for
24 weeks lowered the risk of major adverse cardiac events (MACE) in depressed patients following
recent acute coronary syndrome. The authors should be commended for exploring new approaches to
reduce major cardiac events. However, in their paper, they seem to misunderstand the purpose of
statistical power. In hypothesis testing, power is the probability of correctly rejecting the null hypothesis,
given that the alternative hypothesis is true. (2) Power analyses are useful for designing studies but lose their utility after the study data have been analyzed. (3)
In the design phase, Kim et al used data from the Korea Acute Myocardial Infarction Registry
(KAMIR) study and expected the power of their study to be 70% and 96% for detecting 10% and 15%
group differences, respectively, with a sample of 300 participants. This is an appropriate use of power
analysis, often referred to as “design power.” It is based on the study design and beliefs about the true
population structure. After completing the study and analyzing the data, the authors calculated how much power their test had based on the hazard ratio (HR=0.69). They estimated that their test had 89.7% power to detect between-group differences in MACE incidence rates. This is referred to as “observed power” because it is calculated from the observed study data. (3) This form of power analysis is misleading.
One cannot know the true power of a statistical test because there is no way to be sure that the
effect size estimate from the study (HR=0.69) is the true parameter. Therefore, one cannot be certain that the observed power (89.7%) is the true power of the test. Observed power can often mislead researchers into having a false sense of confidence in their results. Furthermore, observed power, which is a 1:1 function of the p-value, yields no relevant information not already provided by the p-value. (3,4)
Statistical power is useful for designing future studies, but once the data from a study have been
produced, it is the effect sizes, the p-values, and the confidence intervals that contain insightful
information. Observed power adds little to this and can be misleading. (3) If one wishes to see how uncertain (3) they should be about their point estimate, then they should look at the width of the confidence intervals and the values they cover. (2)
References:
Kim J-M, Stewart R, Lee Y-S, et al. Effect of Escitalopram vs Placebo Treatment for Depression on Long-term Cardiac Outcomes in Patients With Acute Coronary Syndrome: A Randomized Clinical Trial. JAMA. 2018;320(4):350-358. doi:10.1001/jama.2018.9422
Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-350. doi:10.1007/s10654-016-0149-3
Hoenig JM, Heisey DM. The Abuse of Power. Am Stat. 2001;55(1):19-24.
doi:10.1198/000313001300339897
Greenland S. Nonsignificance plus high power does not imply support for the null over the
alternative. Ann Epidemiol. 2012;22(5):364-368. doi:10.1016/j.annepidem.2012.02.007
I do hope @f2harrell will forgive my use of humorous GIF in what is generally meant to be a serious place of discussion. If that’s problematic, Frank, I’m happy to remove this.
But, seriously, what the heck was the purpose of reporting that sentence? And how in the world did this get past JAMA statistical review? The non-acceptance of your letter is very much in character with @Sander’s post in another thread. That’s an absolutely absurd statement which has no place in a statistical results section, and (as you pointed out) it’s an example of the fallacy pointed out by Daniel Lakens in his blog post on “observed power” (he’s not the first to report this, it’s just a blog post I reference when asked about the subject) that this has a direct correlation to the p-value, it provides no other relevant information.
I am sorry that JAMA rejected your letter. It seems that this should be published or recorded somewhere. Any thoughts from the peanut gallery on where this could be published, and on the concept of “observed power” as reported in the literature?
@ADAlthousePhD, it’s kind of ridiculous. I really don’t know how a lot of things make it past review. I’ve submitted so many of these things, (some have gotten accepted but almost take forever to publish) but I find that it’s probably a lot more helpful to have it published on a blog post where researchers can easily access it and where the authors aren’t limited by a 300-400 word limit and reference limit.
We’ve had some discussion of this on Twitter recently (I believe you’ve participated) about the apparent futility of writing letters to the editor to correct methodological or statistical errors. It is humorous - occasionally when one of us posts a Twitter thread about such a problem, we inevitably get put on blast by at least one member of the tone police who asks why we didn’t write a polite letter to the editor first. This is why!
@zad, I think this would be well worth a Twitter thread of your own, if you care to share it. Should be pretty easy to break down your letter into some bite size tweets, and would be great to share around the Twitter-sphere. Could also be an effective way to tag JAMA and tweak them a little bit for ignoring an obvious error.
I suppose another option would be to post it on https://arxiv.org, which would then be a permanent, linkable (and potentially referenced?) resource, even if not directly peer reviewed. I’ve been mulling a JAMA letter to the editor on a completely different manuscript and subject, of course expecting to be rejected as well…
Does anyone remember The Lazlo Letters by Lazlo Toth (Don Novello)? Maybe publishing the letter to JAMA, and email correspondence with the journal would be fun to read.
But more seriously I can imagine that a journal would routinely dismiss letters because they cost column space with little return. I think the letter by @zad is really nice. But the potential impact of a letter might not be sufficiently enticing for a journal. Perhaps if the letter can be recast as a commentary or otherwise serve a teaching role it might stand a better chance.
And, perhaps the JAMA statistical reviewers could be invited to join in a commentary or editorial reminding readers about hypothesis testing and statistical power and the limits of those tools.
And if JAMA does not like it, maybe the Annals of Internal Medicine, Lancet, BMJ, would appreciate the commentary.
Perhaps others have had a different experience, but my (brief) experience has generally been that journals do not like to publish direct responses and/or critiques about a specific article published in another journal. There are various reasons why this may be true, but some will tell you that .
Oddly enough, JAMA has been doing sort of a running series of articles carrying the banner “JAMA Guide to Statistics and Methods” (like this one):
So they aren’t averse to having methods discussions. It’s just disappointing that someone could write a brief, accurate letter describing a very specific problem in a published article, and have that simply brushed off.
…has this to say about post-hoc power analysis (p181):
A power analysis can help with the interpretation of study findings when statistically significant effects are not found. However, because the findings in the study by Koegelenberg et al1 were statistically significant, interpretation of a lack of significance was unnecessary. If no statistically significant difference in abstinence rates had been found, the authors could have noted that, “The study was sufficiently powered to have a high chance of detecting a difference of 14% in abstinence rates. Thus, any undetected difference is likely to be of little clinical benefit.”
So that is interesting, right? Certainly implies to me the topic could be updated.
Yeah, that inspired some chatter between a few of us here and on twitter. Also worth noting, the surgeons wrote a reply to gelman’s letter which included a few astonishingly arrogant and naive statements. My favorite was something about how “unique” surgical science is and why that made post hoc power calculations necessary in surgical papers. Really incredible stuff, haha.
Also in the response to Gelman’s letter was this gem:
Basically: “we don’t think we can do a priori power calculations, but we think all authors should include post hoc power in their articles.” Incredible.
yes, statistical habits in surgery are a bit idiosyncratic. This statistician writing in the Journal of investigative surgery illustrates how to do a post hoc power calculation: ref. He concludes “Therefore, 80% power was not achieved for this sample. Using statistical software, the estimated power of this test is 0.343. Even though 80% power was not achieved with this data …” Strange how these habits are confined to certain disease areas, suggests statisticians should migrate between subject areas more
So who hasn’t seen their letters to the editor rejected, ignored is perhaps a better term.
For the study mentioned I also had some statistical and substantive queries that not surprisingly also didn’t see the light of day. If interested here it is:
Kim (1) concludes that “treatment with escitalopram compared with placebo resulted in a lower risk of major adverse cardiac events” in patients treated for 24 weeks following an ACS. Perhaps the authors could comment on the effect size (31% reduction) which is almost twice the average size of other positive interventions in cardiology. Also some comment regarding the fact that benefit is only seen 4-5 years after the intervention has been concluded would be appreciated. Finally given that the MACE outcome was not listed in the registered protocol (https://clinicaltrials.gov/ct2/show/NCT00419471) as either a primary or secondary outcome, how can readers be assured that the results are not the result of “p hacking”?
While it is certainly possible that this is an overestimate of the effect size from a small(ish) trial, I don’t like the idea of raising suspicion simply because the effect size is large. The authors presented a CI for their findings, we all should be smart enough to read and know that the point estimate is just that - an estimate based on the observed data.
I’m not sure where you’re getting this. The KM curves for the MACE endpoint in Figure 2 have diverged by 2 years and continue to widen thereafter. Furthermore, given that this is an intervention targeted at treatment of depressed patients to improve cardiac outcomes, it seems plausible that the effects of the intervention would not emerge immediately, and that it may carry on for some time (since “depression” -> “cardiac outcomes” is not a connection that happens immediately, but is more likely due to chronic exposure to a number of things - depressed patients eating poorly, not exercising well, not sleeping well, and any number of other sequelae of depression).
Now this (would) seem a legitimate concern if there was some evidence of switching of the primary outcome that turned a negative result into a positive result, but I think you may be digging for trouble where there is none (although I cannot be sure) and also using the “p-hacking” gun a little too liberally. The outcomes listed on that page were all depression-related outcomes. The authors did previously publish the findings for those outcomes, here:
These seems to be the trial’s primary results papers with all of the outcomes listed in the clinicaltrials.gov listing you found. The authors were not trying to obscure their primary results.
The additional paper with the MACE outcome appears to be a secondary paper from the same trial reporting the long-term cardiac outcomes. This is not uncommon; if you’ve gone to the trouble to recruit and enroll a trial cohort, and you’re able to continue following them and publishing longer-term outcomes, that seems to me a good use of resources.
Personally, I’d be more concerned about p-hacking if the original stated primary outcome had been something like “cardiovascular mortality” and showed no/little benefit but then there was a benefit on a composite MACE outcome. That wasn’t the case here. The authors reported all of their results related to depressive symptoms and anxiety (the trial’s primary and secondary outcomes that appeared in the initial protocol). I see no major problem with subsequently reporting the cardiac outcomes (since it was a population of ACS patients that were enrolled…)
It is excellent practice to compare the results obtained with the assumptions made at the planning stage. However, as others have rightly pointed out, to re-assess power retrospectively by taking the observed effect size as the true value is a total fiction. In fact, posterior power is best regarded as merely a rescaling of the p-value. For example, assume that the study was designed to have an 80% power to detect a specified size of difference using a test at the conventional 2-sided 5% alpha level. Then, if the parameter estimates from the data are just what was assumed in advance, we expect to end up with z or t around 2.8, not 2 - because the planned power was 80%, not 50%. Or, if the eventual test is chi-square, this would be around 8, instead of 4. A retrospective power of 80% simply means that p was about 0.005.