"Observed Power" and other "Power" Issues




It’s amazing how statisticians can spend 20 years writing papers that explain this, and yet the fiction of a retrospective power calculation using the observed effect size lives on in clinical journals as a thing people think is not only useful, but necessary. Even after it’s been explained to them…the response came and showed that they had learned nothing:


“We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. As such, we believe that post hoc power is not wrong, but instead a necessary first assistant in interpreting results.”


@zad and I have just submitted a letter in reply to the surgeon’s double-down. If the journal accepts, I will link; if the journal declines, I will post the full content here, possibly with a guest post on someone’s blog as well.


I’ll have to disagree a bit. Non-plausibility of effect size can be a reason to cast doubt. As John Ionnidis published years ago, don’t trust an odds ratio above 3.0 in a genetic study.


I can’t agree entirely. If you agree that what matters is the false positive risk (FPR) rather than the p value, the estimation of the false positive risk depends om (your best estimate of) the power of the experiment. See Fig 1 in https://arxiv.org/abs/1802.04888 (in press, American Statistician).
(I guess for consistency, power should be redefined in terms FPR = 0.05 rather than p = 0.05, but that is for another day.)


While agreeing in principle with most of what’s been written here, there is also the question if what happens in practice. It’s easy to examine this by simulation. Take the example in section 5.1 of my 2017 paper.
If one looks at a lot of simulated t tests, with theoretical power 0.78, and look only at those which come out with 0.045 < p < 0.05 (i.e. we use the p-equals case) then the theoretical likelihood ratio in favour of H1 is 2.76, so the false positive risk for prior odds = 1 is 1/(1 + 2.76) = 26.6%.
In real life, we don’t know the true effect size etc, but if, for each simulated experiment, we calculate the observed likelihood ratio, from the observed effect size and standard deviation in each simulated experiment, the likelihood ratio comes out not at 2.76, but as 3.64 (corresponding to a false positive risk of 1/ (1 + 3.64) = 21.6%. Although this is not the same as the theoretical result, it’s close enough that the interpretation would be much the same in practice.

So calculation based on the observed results, though wrong in principle, would be close enough in practice (in this case) to give a sensible result.


This has been a great thread, and I think @ADAlthousePhD and @zad have really brought up some great points. My question relates to non-significant findings. An example is a study I recently completed. It was retrospective and based on a large, fixed sample. Is it reasonable to calculate power using sample size, alpha, and what I thought was the most clinically reasonable effect size prior to running the analysis? The reasoning is that I keep hearing, “oh you didn’t find a difference, well you are underpowered to see a difference.” I would love to hear what you all think about this.


Tim - that’s a great question.

@zad and I actually submitted a letter in response to that surgical journal which addresses this to some extent. The letter seems to still be under consideration, but I will post a few pruned excerpts which may help crystallize this for you:

"Consider a parallel-group randomized trial with a very large sample size of 100,000 participants (exactly 50,000 patients per treatment arm). Suppose that 25,000 participants (exactly 50%) experience the primary outcome in the arm receiving standard of care. Suppose that 24,995 participants (exactly 49.99%) experience the primary outcome in the arm receiving experimental therapy. This primary comparison will return a non-significant result (p=0.98) suggesting that there is no difference between the treatment groups. Before looking ahead, we can all agree that 100,000 participants with an event rate of 50% seems like it would be very well-powered, yes?

We perform a post-hoc power calculation as prescribed, using the study sample size and observed effect size and see that the study has only 5% statistical power. Ah-ha! Maybe the null result isn’t evidence of no effect – after all, the study was simply “underpowered” to detect an effect…or was it?

Was this study actually underpowered to detect a clinically relevant effect? Of course not. A trial with 100,000 participants would have 88.5% power if the true effect even reduced the event rate from 50% with standard of care to 49% with experimental therapy (a mere 1% absolute reduction).

A more sensible alternative: instead of conducting a post-hoc power calculation based on observed effect size, the authors could show what a power calculation would have looked like with their chosen sample size and some small-to-moderate effect size that would be notable enough to adopt the therapy in clinical practice (essentially, the “most clinically reasonable effect size” that you mention here, @TFeend ).

How would that look in practice? Let’s now consider a trial with 1,000 patients (exactly 500 per treatment arm). In the standard of care, suppose that 250 participants (exactly 50%) experienced the primary outcome. In the experimental arm, suppose that 249 participants (49.8%) experienced the primary outcome. Again, this comparison will return a non-significant result (p=0.95) suggesting that there is no difference between the treatment groups. Again, a post-hoc power calculation based upon the observed effect size will show that the study had extremely low statistical power (about 5%).

This information is virtually useless; a more useful alternative would be computing statistical power that the sample size would have for some clinically relevant effect size. Suppose that we would want to adopt this therapy if it reduced the true event rate from 50% to 40%. A hypothetical trial with 1,000 participants (500 per arm) would have 89% power to detect such a difference, and therefore we may conclude that this initial trial was not “underpowered” to detect a clinically relevant effect – the trial was in fact reasonably well-powered to detect such an effect."

Hopefully that is helpful.


I also want to return to the point that power calculations are made using other wrong assumptions, such as overestimating average event probabilities and underestimating variance of continuous Y. There are many reasons that power is merely a pre-study issue, if even good then.


I do appreciate those concerns. In the stuck-between-rock-and-hard place that Tim (and others, including myself) may find ourselves in - a paper returned with a reviewer comment that wishes to dismiss a finding of “no relationship” by stating “your study must be underpowered, please include a post-hoc power calculation” - do you consider the approach suggested above at all reasonable?

I can appreciate the hand-wavy nature of power calculations before the study as well, but surely you see some value in estimating the number of patients needed for something - even if not a dichotomous decision rule, the ability to estimate an effect within some amount of precision? I am often concerned about the accuracy in the power calculations that I provide to collaborators, but I would prefer that they be based on something rather than simply recruiting whatever number of patients they think is enough (they’d almost always underestimate, I suspect).


I really like this example. Helps me (non-expert) to better understand the concepts being discussed. How similar is this to doing a post-hoc (conditional) equivalence test with a minimally important difference?


Yes I do find it reasonable, especially if you footnote it with a caveat or two. Even better would be to stick with a compatibility interval.


I still believe this is one of the best papers on statistical power by Hoenig & Heisey


along with this paper by @Sander


My suggestion is also always look to the compatibility intervals! That would be a far more fruitful effort than trying to determine whether your study was underpowered or not. It is often difficult to know whether the reason you have a nonsignificant finding is because of a true indistinguishable difference or because your study was underpowered. A way to reduce your uncertainty about whether the test hypothesis is really true is by utilizing equivalence testing (if you must test hypotheses) or you could attempt to plan studies based on precision so that the computed compatibility interval or (a function of it) is closely clustered around certain values.