"Observed Power" and other "Power" Issues

So who hasn’t seen their letters to the editor rejected, ignored is perhaps a better term.
For the study mentioned I also had some statistical and substantive queries that not surprisingly also didn’t see the light of day. If interested here it is:

Kim (1) concludes that “treatment with escitalopram compared with placebo resulted in a lower risk of major adverse cardiac events” in patients treated for 24 weeks following an ACS. Perhaps the authors could comment on the effect size (31% reduction) which is almost twice the average size of other positive interventions in cardiology. Also some comment regarding the fact that benefit is only seen 4-5 years after the intervention has been concluded would be appreciated. Finally given that the MACE outcome was not listed in the registered protocol (https://clinicaltrials.gov/ct2/show/NCT00419471) as either a primary or secondary outcome, how can readers be assured that the results are not the result of “p hacking”?

1 Like

While it is certainly possible that this is an overestimate of the effect size from a small(ish) trial, I don’t like the idea of raising suspicion simply because the effect size is large. The authors presented a CI for their findings, we all should be smart enough to read and know that the point estimate is just that - an estimate based on the observed data.

I’m not sure where you’re getting this. The KM curves for the MACE endpoint in Figure 2 have diverged by 2 years and continue to widen thereafter. Furthermore, given that this is an intervention targeted at treatment of depressed patients to improve cardiac outcomes, it seems plausible that the effects of the intervention would not emerge immediately, and that it may carry on for some time (since “depression” -> “cardiac outcomes” is not a connection that happens immediately, but is more likely due to chronic exposure to a number of things - depressed patients eating poorly, not exercising well, not sleeping well, and any number of other sequelae of depression).

Now this (would) seem a legitimate concern if there was some evidence of switching of the primary outcome that turned a negative result into a positive result, but I think you may be digging for trouble where there is none (although I cannot be sure) and also using the “p-hacking” gun a little too liberally. The outcomes listed on that page were all depression-related outcomes. The authors did previously publish the findings for those outcomes, here:

https://www.psychiatrist.com/jcp/article/Pages/2015/v76n01/v76n0110.aspx

These seems to be the trial’s primary results papers with all of the outcomes listed in the clinicaltrials.gov listing you found. The authors were not trying to obscure their primary results.

The additional paper with the MACE outcome appears to be a secondary paper from the same trial reporting the long-term cardiac outcomes. This is not uncommon; if you’ve gone to the trouble to recruit and enroll a trial cohort, and you’re able to continue following them and publishing longer-term outcomes, that seems to me a good use of resources.

Personally, I’d be more concerned about p-hacking if the original stated primary outcome had been something like “cardiovascular mortality” and showed no/little benefit but then there was a benefit on a composite MACE outcome. That wasn’t the case here. The authors reported all of their results related to depressive symptoms and anxiety (the trial’s primary and secondary outcomes that appeared in the initial protocol). I see no major problem with subsequently reporting the cardiac outcomes (since it was a population of ACS patients that were enrolled…)

1 Like

@ADAlthousePhD I am actually quite shocked by their response to Gelman’s LTE…

I may reply to the reply.

It is excellent practice to compare the results obtained with the assumptions made at the planning stage. However, as others have rightly pointed out, to re-assess power retrospectively by taking the observed effect size as the true value is a total fiction. In fact, posterior power is best regarded as merely a rescaling of the p-value. For example, assume that the study was designed to have an 80% power to detect a specified size of difference using a test at the conventional 2-sided 5% alpha level. Then, if the parameter estimates from the data are just what was assumed in advance, we expect to end up with z or t around 2.8, not 2 - because the planned power was 80%, not 50%. Or, if the eventual test is chi-square, this would be around 8, instead of 4. A retrospective power of 80% simply means that p was about 0.005.

1 Like

Correct.

It’s amazing how statisticians can spend 20 years writing papers that explain this, and yet the fiction of a retrospective power calculation using the observed effect size lives on in clinical journals as a thing people think is not only useful, but necessary. Even after it’s been explained to them…the response came and showed that they had learned nothing:

https://insights.ovid.com/pubmed?pmid=29979247

“We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. As such, we believe that post hoc power is not wrong, but instead a necessary first assistant in interpreting results.”

coffee%20spit

@zad and I have just submitted a letter in reply to the surgeon’s double-down. If the journal accepts, I will link; if the journal declines, I will post the full content here, possibly with a guest post on someone’s blog as well.

I’ll have to disagree a bit. Non-plausibility of effect size can be a reason to cast doubt. As John Ionnidis published years ago, don’t trust an odds ratio above 3.0 in a genetic study.

2 Likes

I can’t agree entirely. If you agree that what matters is the false positive risk (FPR) rather than the p value, the estimation of the false positive risk depends om (your best estimate of) the power of the experiment. See Fig 1 in https://arxiv.org/abs/1802.04888 (in press, American Statistician).
(I guess for consistency, power should be redefined in terms FPR = 0.05 rather than p = 0.05, but that is for another day.)

While agreeing in principle with most of what’s been written here, there is also the question if what happens in practice. It’s easy to examine this by simulation. Take the example in section 5.1 of my 2017 paper.
If one looks at a lot of simulated t tests, with theoretical power 0.78, and look only at those which come out with 0.045 < p < 0.05 (i.e. we use the p-equals case) then the theoretical likelihood ratio in favour of H1 is 2.76, so the false positive risk for prior odds = 1 is 1/(1 + 2.76) = 26.6%.
In real life, we don’t know the true effect size etc, but if, for each simulated experiment, we calculate the observed likelihood ratio, from the observed effect size and standard deviation in each simulated experiment, the likelihood ratio comes out not at 2.76, but as 3.64 (corresponding to a false positive risk of 1/ (1 + 3.64) = 21.6%. Although this is not the same as the theoretical result, it’s close enough that the interpretation would be much the same in practice.

So calculation based on the observed results, though wrong in principle, would be close enough in practice (in this case) to give a sensible result.

This has been a great thread, and I think @ADAlthousePhD and @zad have really brought up some great points. My question relates to non-significant findings. An example is a study I recently completed. It was retrospective and based on a large, fixed sample. Is it reasonable to calculate power using sample size, alpha, and what I thought was the most clinically reasonable effect size prior to running the analysis? The reasoning is that I keep hearing, “oh you didn’t find a difference, well you are underpowered to see a difference.” I would love to hear what you all think about this.

2 Likes

Tim - that’s a great question.

@zad and I actually submitted a letter in response to that surgical journal which addresses this to some extent. The letter seems to still be under consideration, but I will post a few pruned excerpts which may help crystallize this for you:

"Consider a parallel-group randomized trial with a very large sample size of 100,000 participants (exactly 50,000 patients per treatment arm). Suppose that 25,000 participants (exactly 50%) experience the primary outcome in the arm receiving standard of care. Suppose that 24,995 participants (exactly 49.99%) experience the primary outcome in the arm receiving experimental therapy. This primary comparison will return a non-significant result (p=0.98) suggesting that there is no difference between the treatment groups. Before looking ahead, we can all agree that 100,000 participants with an event rate of 50% seems like it would be very well-powered, yes?

We perform a post-hoc power calculation as prescribed, using the study sample size and observed effect size and see that the study has only 5% statistical power. Ah-ha! Maybe the null result isn’t evidence of no effect – after all, the study was simply “underpowered” to detect an effect…or was it?

Was this study actually underpowered to detect a clinically relevant effect? Of course not. A trial with 100,000 participants would have 88.5% power if the true effect even reduced the event rate from 50% with standard of care to 49% with experimental therapy (a mere 1% absolute reduction).

A more sensible alternative: instead of conducting a post-hoc power calculation based on observed effect size, the authors could show what a power calculation would have looked like with their chosen sample size and some small-to-moderate effect size that would be notable enough to adopt the therapy in clinical practice (essentially, the “most clinically reasonable effect size” that you mention here, @TFeend ).

How would that look in practice? Let’s now consider a trial with 1,000 patients (exactly 500 per treatment arm). In the standard of care, suppose that 250 participants (exactly 50%) experienced the primary outcome. In the experimental arm, suppose that 249 participants (49.8%) experienced the primary outcome. Again, this comparison will return a non-significant result (p=0.95) suggesting that there is no difference between the treatment groups. Again, a post-hoc power calculation based upon the observed effect size will show that the study had extremely low statistical power (about 5%).

This information is virtually useless; a more useful alternative would be computing statistical power that the sample size would have for some clinically relevant effect size. Suppose that we would want to adopt this therapy if it reduced the true event rate from 50% to 40%. A hypothetical trial with 1,000 participants (500 per arm) would have 89% power to detect such a difference, and therefore we may conclude that this initial trial was not “underpowered” to detect a clinically relevant effect – the trial was in fact reasonably well-powered to detect such an effect."

Hopefully that is helpful.

5 Likes

I also want to return to the point that power calculations are made using other wrong assumptions, such as overestimating average event probabilities and underestimating variance of continuous Y. There are many reasons that power is merely a pre-study issue, if even good then.

3 Likes

I do appreciate those concerns. In the stuck-between-rock-and-hard place that Tim (and others, including myself) may find ourselves in - a paper returned with a reviewer comment that wishes to dismiss a finding of “no relationship” by stating “your study must be underpowered, please include a post-hoc power calculation” - do you consider the approach suggested above at all reasonable?

I can appreciate the hand-wavy nature of power calculations before the study as well, but surely you see some value in estimating the number of patients needed for something - even if not a dichotomous decision rule, the ability to estimate an effect within some amount of precision? I am often concerned about the accuracy in the power calculations that I provide to collaborators, but I would prefer that they be based on something rather than simply recruiting whatever number of patients they think is enough (they’d almost always underestimate, I suspect).

3 Likes

I really like this example. Helps me (non-expert) to better understand the concepts being discussed. How similar is this to doing a post-hoc (conditional) equivalence test with a minimally important difference?

Yes I do find it reasonable, especially if you footnote it with a caveat or two. Even better would be to stick with a compatibility interval.

I still believe this is one of the best papers on statistical power by Hoenig & Heisey

https://doi.org/10.1198/000313001300339897

along with this paper by @Sander

3 Likes

My suggestion is also always look to the compatibility intervals! That would be a far more fruitful effort than trying to determine whether your study was underpowered or not. It is often difficult to know whether the reason you have a nonsignificant finding is because of a true indistinguishable difference or because your study was underpowered. A way to reduce your uncertainty about whether the test hypothesis is really true is by utilizing equivalence testing (if you must test hypotheses) or you could attempt to plan studies based on precision so that the computed compatibility interval or (a function of it) is closely clustered around certain values.

3 Likes

Unfortunately, it seems that the issue of post-hoc power / “observed” power just won’t die. Apologies to those that follow me on Twitter who are already aware of this, as I’ve been hammering this article for the last day or two, but a group of surgeons have been promoting the nonsensical idea that all studies which report negative results should include a post-hoc power calculation using the observed effect size, an idea which the statistical community has written a number of articles about over the years:

I have provided a lengthy reply at PubPeer:

https://pubpeer.com/publications/4399282A80691D9421B497E8316CF6#2

If you read this article and are as troubled as I am, I encourage you to also leave a comment at PubPeer. This article is wrong about nearly everything it says (the initial premise that “absence of evidence is not evidence of absence” and giving more thought to power/sample size, I am fine with; however, the entire applied portion is wrong as well as the subsequent recommendation that all studies be reported with an “observed” power, for reasons laid out in my PubPeer comments).

My PubPeer reply references several excellent resources which you may be interested in should you ever be pushed on this issue of “observed” power in a completed study:

5 Likes

I am so glad you are fighting this battle! And your point on twitter that a paper all about statistics not getting statistical peer review is really concerning.

6 Likes

I actually emailed the Statistical Editor of the journal directly to ask about this, and he told me that he had not been sent this paper for review and did not know about it until I showed it to him, which confirms that a clinical medicine journal published an article entirely about the practice of statistics without getting any statistical input. It’s the equivalent of Statistics in Medicine publishing a paper about surgical technique without getting any surgeons to review the paper.

EDIT: also, the reason I am hopeful more people will leave PubPeer comments is my fear that the journal will think this is just another nuanced debate which requires “both sides” to have their say, when this is not a “both sides” issue. I do not want the editor to say “Thanks for raising your concerns, we will let you publish a nicely worded 500 word letter summarizing them” and leave this exchange back and forth in the literature. The paper is flat wrong, it’s misleading, and it should be removed from the literature. Otherwise, it may lead to scores of subsequent surgery papers that showed minimal benefits of a treatment citing this paper, performing the misguided calculation of “observed power” and saying “We cannot rule out the lack of treatment effect; our study had an observed power of just 15 percent. A larger study may have shown benefit” (which they do not realize is basically the same as saying “if the treatment worked better, we would have shown that it worked”)

4 Likes