This has been a great thread, and I think @ADAlthousePhD and @zad have really brought up some great points. My question relates to non-significant findings. An example is a study I recently completed. It was retrospective and based on a large, fixed sample. Is it reasonable to calculate power using sample size, alpha, and what I thought was the most clinically reasonable effect size prior to running the analysis? The reasoning is that I keep hearing, “oh you didn’t find a difference, well you are underpowered to see a difference.” I would love to hear what you all think about this.

# "Observed Power" and other "Power" Issues

Tim - that’s a great question.

@zad and I actually submitted a letter in response to that surgical journal which addresses this to some extent. The letter seems to still be under consideration, but I will post a few pruned excerpts which may help crystallize this for you:

"Consider a parallel-group randomized trial with a very large sample size of 100,000 participants (exactly 50,000 patients per treatment arm). Suppose that 25,000 participants (exactly 50%) experience the primary outcome in the arm receiving standard of care. Suppose that 24,995 participants (exactly 49.99%) experience the primary outcome in the arm receiving experimental therapy. This primary comparison will return a non-significant result (p=0.98) suggesting that there is no difference between the treatment groups. Before looking ahead, we can all agree that 100,000 participants with an event rate of 50% seems like it would be very well-powered, yes?

We perform a post-hoc power calculation as prescribed, using the study sample size and observed effect size and see that the study has only 5% statistical power. Ah-ha! Maybe the null result isn’t evidence of no effect – after all, the study was simply “underpowered” to detect an effect…or was it?

Was this study *actually* underpowered to detect a *clinically relevant* effect? Of course not. A trial with 100,000 participants would have 88.5% power if the true effect even reduced the event rate from 50% with standard of care to 49% with experimental therapy (a mere 1% absolute reduction).

A more sensible alternative: instead of conducting a post-hoc power calculation based on *observed* effect size, the authors could show what a power calculation would have looked like with their chosen sample size and some small-to-moderate effect size that would be notable enough to adopt the therapy in clinical practice (essentially, the “most clinically reasonable effect size” that you mention here, @TFeend ).

How would that look in practice? Let’s now consider a trial with 1,000 patients (exactly 500 per treatment arm). In the standard of care, suppose that 250 participants (exactly 50%) experienced the primary outcome. In the experimental arm, suppose that 249 participants (49.8%) experienced the primary outcome. Again, this comparison will return a non-significant result (p=0.95) suggesting that there is no difference between the treatment groups. Again, a post-hoc power calculation based upon the *observed* effect size will show that the study had extremely low statistical power (about 5%).

This information is virtually useless; a more useful alternative would be computing statistical power that the sample size would have for some clinically relevant effect size. Suppose that we would want to adopt this therapy if it reduced the true event rate from 50% to 40%. A hypothetical trial with 1,000 participants (500 per arm) would have 89% power to detect such a difference, and therefore we may conclude that this initial trial was not “underpowered” to detect a *clinically relevant* effect – the trial was in fact reasonably well-powered to detect such an effect."

Hopefully that is helpful.

I also want to return to the point that power calculations are made using other wrong assumptions, such as overestimating average event probabilities and underestimating variance of continuous Y. There are many reasons that power is merely a pre-study issue, if even good then.

I do appreciate those concerns. In the stuck-between-rock-and-hard place that Tim (and others, including myself) may find ourselves in - a paper returned with a reviewer comment that wishes to dismiss a finding of “no relationship” by stating “your study must be underpowered, please include a post-hoc power calculation” - do you consider the approach suggested above at all reasonable?

I can appreciate the hand-wavy nature of power calculations before the study as well, but surely you see some value in estimating the number of patients needed for something - even if not a dichotomous decision rule, the ability to estimate an effect within some amount of precision? I am often concerned about the accuracy in the power calculations that I provide to collaborators, but I would prefer that they be based on *something* rather than simply recruiting whatever number of patients they think is enough (they’d almost always underestimate, I suspect).

I really like this example. Helps me (non-expert) to better understand the concepts being discussed. How similar is this to doing a post-hoc (conditional) equivalence test with a minimally important difference?

Yes I do find it reasonable, especially if you footnote it with a caveat or two. Even better would be to stick with a compatibility interval.

I still believe this is one of the best papers on statistical power by Hoenig & Heisey

https://doi.org/10.1198/000313001300339897

along with this paper by @Sander

My suggestion is also always look to the compatibility intervals! That would be a far more fruitful effort than trying to determine whether your study was underpowered or not. It is often difficult to know whether the reason you have a nonsignificant finding is because of a true indistinguishable difference or because your study was underpowered. A way to reduce your uncertainty about whether the test hypothesis is really true is by utilizing equivalence testing (if you must test hypotheses) or you could attempt to plan studies based on precision so that the computed compatibility interval or (a function of it) is closely clustered around certain values.

Unfortunately, it seems that the issue of post-hoc power / “observed” power just won’t die. Apologies to those that follow me on Twitter who are already aware of this, as I’ve been hammering this article for the last day or two, but a group of surgeons have been promoting the nonsensical idea that all studies which report negative results should include a post-hoc power calculation using the observed effect size, an idea which the statistical community has written a number of articles about over the years:

I have provided a lengthy reply at PubPeer:

https://pubpeer.com/publications/4399282A80691D9421B497E8316CF6#2

If you read this article and are as troubled as I am, I encourage you to also leave a comment at PubPeer. This article is wrong about nearly everything it says (the initial premise that “absence of evidence is not evidence of absence” and giving more thought to power/sample size, I am fine with; however, the entire applied portion is wrong as well as the subsequent recommendation that all studies be reported with an “observed” power, for reasons laid out in my PubPeer comments).

My PubPeer reply references several excellent resources which you may be interested in should you ever be pushed on this issue of “observed” power in a completed study:

I am so glad you are fighting this battle! And your point on twitter that a paper all about statistics not getting statistical peer review is really concerning.

I actually emailed the Statistical Editor of the journal directly to ask about this, and he told me that he had not been sent this paper for review and did not know about it until I showed it to him, which confirms that a clinical medicine journal published an article entirely about the practice of statistics without getting any statistical input. It’s the equivalent of Statistics in Medicine publishing a paper about surgical technique without getting any surgeons to review the paper.

EDIT: also, the reason I am hopeful more people will leave PubPeer comments is my fear that the journal will think this is just another nuanced debate which requires “both sides” to have their say, when this is not a “both sides” issue. I do not want the editor to say “Thanks for raising your concerns, we will let you publish a nicely worded 500 word letter summarizing them” and leave this exchange back and forth in the literature. The paper is flat wrong, it’s misleading, and it should be removed from the literature. Otherwise, it may lead to scores of subsequent surgery papers that showed minimal benefits of a treatment citing this paper, performing the misguided calculation of “observed power” and saying “We cannot rule out the lack of treatment effect; our study had an observed power of just 15 percent. A larger study may have shown benefit” (which they do not realize is basically the same as saying “if the treatment worked better, we would have shown that it worked”)

i added a brief comment. Thanks for this!

Thank you. The more people that chime in at PubPeer, the greater chance the editor will be inclined to realize what a flawed paper this is and that the field will be best served if it is removed from the literature. I urge all of you to consider leaving a comment with your interpretation of the article.

I realize the point is targeted toward clinical or general medical journals, but I’d also give a shout-out to Epidemiology, which has a lovely policy of considering the publication of letters that were rejected elsewhere.

The current editor (Tim Lash) mentioned this on twitter some time ago: https://twitter.com/timothylash/status/979335099920670720

Please realize that ~~@~~ **EpidemiologyLWW** is willing to consider letters that were rejected by editors of other journals, a policy first announced by founding editor ~~@~~ **ken_rothman** in 1993. Rejected letters should be submitted with evidence of the original journal’s rejection.

That is a great policy. In the case of particularly obstinate journals, well worth considering.

In the case of the post-hoc power article that I have posted about most recently, retraction (rather than an exchange of letters) is what’s desperately needed. Even if our letter(s) pointing out the flaws get published, the original paper will remain in the literature and people unaware of the flaws will continue to read it and think “Gee, what a great idea!”

I totally agree regarding the post-hoc power article you have been (thankfully) bringing attention to. That deserves retracting and I hope that is the end result.

I should have been more clear that I was suggesting submitting to Epid as a potential alternative for the letter that @zad had rejected by JAMA. Gotta work on my tagging skillz.

No problem at all. I know what you meant, and am thankful you brought it to attention, as I may utilize in the future if journals demur on publishing letters that point out major flaws in their publications. I do hope to persuade editors in some cases that “look, publishing a letter would be better for my CV, but a retraction would be better for science.”

Apologies that we are a little off-topic on “Observed Power.” Thus ends this threadjack.

I’d like to offer a challenge that might be helpful in reaching your true audience: surgeons. I cannot believe this issue is so complex as to require such a lengthy and elaborate discussion. I have to think that you could build a Shiny app, and make a 140-second (tweetable) video that demolishes *post hoc* power for all time. (This might make a nice project for a graduate student, who could even gain some social-media exposure out of it, plus the opportunity to develop and demonstrate statistical teaching/communication/outreach skills.)

Even surgeons and surgeons-in-training will have time for a 140-second video, especially if it criticizes some of their prominent colleagues. As an example of what’s possible in this medium, under its severe constraints, consider this 127-second video produced by yours truly.

In some ways, this has already been done (maybe not the “video” part, but blog posts with a single Figure/simulation that explain it). The problem - at least insofar as I can see - is that understanding that single Figure or simulation still requires an understanding of what statistical power is in the first place, and that absent that understanding, no single Figure or simulation gets it across. So I do not believe, or at least I’m not easily capable of meeting this goal:

I would be more than delighted if you are able to do this. I honestly don’t think that I can.

The surgeons’ interpretation is clearly wrong, but they’ve also been told that in myriad different ways by at least four different replies to their initial publication, and the response was essentially “Thanks, but we’re still right.” If you’ve got a better way to convey this, by all means, go for it.

I guess that’s throwing down the gauntlet, then, and I’ll have to look into it properly. Would you offer a link to the single, most concise—even if ‘incomplete’—treatment you’ve seen so far?