It’s amazing how statisticians can spend 20 years writing papers that explain this, and yet the fiction of a retrospective power calculation using the observed effect size lives on in clinical journals as a thing people think is not only useful, but necessary. Even after it’s been explained to them…the response came and showed that they had learned nothing:
“We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. As such, we believe that post hoc power is not wrong, but instead a necessary first assistant in interpreting results.”
@zad and I have just submitted a letter in reply to the surgeon’s double-down. If the journal accepts, I will link; if the journal declines, I will post the full content here, possibly with a guest post on someone’s blog as well.
I’ll have to disagree a bit. Non-plausibility of effect size can be a reason to cast doubt. As John Ionnidis published years ago, don’t trust an odds ratio above 3.0 in a genetic study.
I can’t agree entirely. If you agree that what matters is the false positive risk (FPR) rather than the p value, the estimation of the false positive risk depends om (your best estimate of) the power of the experiment. See Fig 1 in https://arxiv.org/abs/1802.04888 (in press, American Statistician).
(I guess for consistency, power should be redefined in terms FPR = 0.05 rather than p = 0.05, but that is for another day.)
While agreeing in principle with most of what’s been written here, there is also the question if what happens in practice. It’s easy to examine this by simulation. Take the example in section 5.1 of my 2017 paper.
If one looks at a lot of simulated t tests, with theoretical power 0.78, and look only at those which come out with 0.045 < p < 0.05 (i.e. we use the p-equals case) then the theoretical likelihood ratio in favour of H1 is 2.76, so the false positive risk for prior odds = 1 is 1/(1 + 2.76) = 26.6%.
In real life, we don’t know the true effect size etc, but if, for each simulated experiment, we calculate the observed likelihood ratio, from the observed effect size and standard deviation in each simulated experiment, the likelihood ratio comes out not at 2.76, but as 3.64 (corresponding to a false positive risk of 1/ (1 + 3.64) = 21.6%. Although this is not the same as the theoretical result, it’s close enough that the interpretation would be much the same in practice.
So calculation based on the observed results, though wrong in principle, would be close enough in practice (in this case) to give a sensible result.
This has been a great thread, and I think @ADAlthousePhD and @zad have really brought up some great points. My question relates to non-significant findings. An example is a study I recently completed. It was retrospective and based on a large, fixed sample. Is it reasonable to calculate power using sample size, alpha, and what I thought was the most clinically reasonable effect size prior to running the analysis? The reasoning is that I keep hearing, “oh you didn’t find a difference, well you are underpowered to see a difference.” I would love to hear what you all think about this.
@zad and I actually submitted a letter in response to that surgical journal which addresses this to some extent. The letter seems to still be under consideration, but I will post a few pruned excerpts which may help crystallize this for you:
"Consider a parallel-group randomized trial with a very large sample size of 100,000 participants (exactly 50,000 patients per treatment arm). Suppose that 25,000 participants (exactly 50%) experience the primary outcome in the arm receiving standard of care. Suppose that 24,995 participants (exactly 49.99%) experience the primary outcome in the arm receiving experimental therapy. This primary comparison will return a non-significant result (p=0.98) suggesting that there is no difference between the treatment groups. Before looking ahead, we can all agree that 100,000 participants with an event rate of 50% seems like it would be very well-powered, yes?
We perform a post-hoc power calculation as prescribed, using the study sample size and observed effect size and see that the study has only 5% statistical power. Ah-ha! Maybe the null result isn’t evidence of no effect – after all, the study was simply “underpowered” to detect an effect…or was it?
Was this study actually underpowered to detect a clinically relevant effect? Of course not. A trial with 100,000 participants would have 88.5% power if the true effect even reduced the event rate from 50% with standard of care to 49% with experimental therapy (a mere 1% absolute reduction).
A more sensible alternative: instead of conducting a post-hoc power calculation based on observed effect size, the authors could show what a power calculation would have looked like with their chosen sample size and some small-to-moderate effect size that would be notable enough to adopt the therapy in clinical practice (essentially, the “most clinically reasonable effect size” that you mention here, @TFeend ).
How would that look in practice? Let’s now consider a trial with 1,000 patients (exactly 500 per treatment arm). In the standard of care, suppose that 250 participants (exactly 50%) experienced the primary outcome. In the experimental arm, suppose that 249 participants (49.8%) experienced the primary outcome. Again, this comparison will return a non-significant result (p=0.95) suggesting that there is no difference between the treatment groups. Again, a post-hoc power calculation based upon the observed effect size will show that the study had extremely low statistical power (about 5%).
This information is virtually useless; a more useful alternative would be computing statistical power that the sample size would have for some clinically relevant effect size. Suppose that we would want to adopt this therapy if it reduced the true event rate from 50% to 40%. A hypothetical trial with 1,000 participants (500 per arm) would have 89% power to detect such a difference, and therefore we may conclude that this initial trial was not “underpowered” to detect a clinically relevant effect – the trial was in fact reasonably well-powered to detect such an effect."
I also want to return to the point that power calculations are made using other wrong assumptions, such as overestimating average event probabilities and underestimating variance of continuous Y. There are many reasons that power is merely a pre-study issue, if even good then.
I do appreciate those concerns. In the stuck-between-rock-and-hard place that Tim (and others, including myself) may find ourselves in - a paper returned with a reviewer comment that wishes to dismiss a finding of “no relationship” by stating “your study must be underpowered, please include a post-hoc power calculation” - do you consider the approach suggested above at all reasonable?
I can appreciate the hand-wavy nature of power calculations before the study as well, but surely you see some value in estimating the number of patients needed for something - even if not a dichotomous decision rule, the ability to estimate an effect within some amount of precision? I am often concerned about the accuracy in the power calculations that I provide to collaborators, but I would prefer that they be based on something rather than simply recruiting whatever number of patients they think is enough (they’d almost always underestimate, I suspect).
I really like this example. Helps me (non-expert) to better understand the concepts being discussed. How similar is this to doing a post-hoc (conditional) equivalence test with a minimally important difference?
My suggestion is also always look to the compatibility intervals! That would be a far more fruitful effort than trying to determine whether your study was underpowered or not. It is often difficult to know whether the reason you have a nonsignificant finding is because of a true indistinguishable difference or because your study was underpowered. A way to reduce your uncertainty about whether the test hypothesis is really true is by utilizing equivalence testing (if you must test hypotheses) or you could attempt to plan studies based on precision so that the computed compatibility interval or (a function of it) is closely clustered around certain values.
Unfortunately, it seems that the issue of post-hoc power / “observed” power just won’t die. Apologies to those that follow me on Twitter who are already aware of this, as I’ve been hammering this article for the last day or two, but a group of surgeons have been promoting the nonsensical idea that all studies which report negative results should include a post-hoc power calculation using the observed effect size, an idea which the statistical community has written a number of articles about over the years:
If you read this article and are as troubled as I am, I encourage you to also leave a comment at PubPeer. This article is wrong about nearly everything it says (the initial premise that “absence of evidence is not evidence of absence” and giving more thought to power/sample size, I am fine with; however, the entire applied portion is wrong as well as the subsequent recommendation that all studies be reported with an “observed” power, for reasons laid out in my PubPeer comments).
My PubPeer reply references several excellent resources which you may be interested in should you ever be pushed on this issue of “observed” power in a completed study:
I am so glad you are fighting this battle! And your point on twitter that a paper all about statistics not getting statistical peer review is really concerning.
I actually emailed the Statistical Editor of the journal directly to ask about this, and he told me that he had not been sent this paper for review and did not know about it until I showed it to him, which confirms that a clinical medicine journal published an article entirely about the practice of statistics without getting any statistical input. It’s the equivalent of Statistics in Medicine publishing a paper about surgical technique without getting any surgeons to review the paper.
EDIT: also, the reason I am hopeful more people will leave PubPeer comments is my fear that the journal will think this is just another nuanced debate which requires “both sides” to have their say, when this is not a “both sides” issue. I do not want the editor to say “Thanks for raising your concerns, we will let you publish a nicely worded 500 word letter summarizing them” and leave this exchange back and forth in the literature. The paper is flat wrong, it’s misleading, and it should be removed from the literature. Otherwise, it may lead to scores of subsequent surgery papers that showed minimal benefits of a treatment citing this paper, performing the misguided calculation of “observed power” and saying “We cannot rule out the lack of treatment effect; our study had an observed power of just 15 percent. A larger study may have shown benefit” (which they do not realize is basically the same as saying “if the treatment worked better, we would have shown that it worked”)
Thank you. The more people that chime in at PubPeer, the greater chance the editor will be inclined to realize what a flawed paper this is and that the field will be best served if it is removed from the literature. I urge all of you to consider leaving a comment with your interpretation of the article.
I realize the point is targeted toward clinical or general medical journals, but I’d also give a shout-out to Epidemiology, which has a lovely policy of considering the publication of letters that were rejected elsewhere.
Please realize that @EpidemiologyLWW is willing to consider letters that were rejected by editors of other journals, a policy first announced by founding editor @ken_rothman in 1993. Rejected letters should be submitted with evidence of the original journal’s rejection.
That is a great policy. In the case of particularly obstinate journals, well worth considering.
In the case of the post-hoc power article that I have posted about most recently, retraction (rather than an exchange of letters) is what’s desperately needed. Even if our letter(s) pointing out the flaws get published, the original paper will remain in the literature and people unaware of the flaws will continue to read it and think “Gee, what a great idea!”
I totally agree regarding the post-hoc power article you have been (thankfully) bringing attention to. That deserves retracting and I hope that is the end result.
I should have been more clear that I was suggesting submitting to Epid as a potential alternative for the letter that @zad had rejected by JAMA. Gotta work on my tagging skillz.