What to interpret when observed effect size is smaller but statistically significant than expected effect size

sample-size
power

#1

I read a recent clinical trial where the researchers powered their study (80%) for an expected effect size of 4% absolute risk reduction. With the same sample size, they, however, report an absolute risk reduction of 3%, which was statistically significant. I am unable to interpret this finding. What does this say about the study’s power calculation? Was it under-powered/over-powered? I am struggling to grasp this finding. Thank you.


#2

This is a good question and one that I see often from clinician-researchers. I’ll be back to write a detailed reply tomorrow morning if no one else has covered it by then.


#3

I’m sure Andrew will add much more. My quick take is that power is a pre-study concept and is used for initial planning. Assumptions that go into power calculations are seldom true. We do look at planned and final sample sizes to make sure the patient recruitment was as intense as promised. But for the final interpretation I’d zero in on the uncertainty interval for the main effect of interest.


#4

Would it fair to say, that in trying “to interpret this finding” @Madhu is asking questions about the (frequentist) 20% of ‘alternative universes’ where that study (per authors’ estimate) would have yielded non-significance? That type of question seems to me irrelevant to the true task of scientific interpretation.


#5

What I am trying to ask is if the researchers power their study for 4% effect size, how are they getting a statistically significant 3% effect with the same sample size? One reason could be the assumption that went into power calculation (such as variance) did not hold. But I am not sure if this is the complete picture.


#6

Detailed answer coming. Working on it now.


#7

@Madhu, this is a good question, and I hope you won’t be offended by saying that the way you’ve phrased this indicates a basic misunderstanding of the concept of statistical power (and it’s a misunderstanding that many clinicians and researchers share, so it is well worth taking the time to explain here).

Contrary to popular belief, stating that your study is powered to detect a certain effect size (IMO, this is poor terminology, for reasons that will be explained below) does not mean that the study is only going to show a “statistically significant” difference if the effect size exceeds that number.

To explain why, we need to go into a few of the mechanics of power and sample size calculations.

First things first: the 4% does not refer to the absolute difference that needs to be observed in the study to return a statistically significant difference. It refers to the “true” effect size that we assumed in designing the experiment.

Let’s suppose that I work with a population of patients where I can reasonably state that mortality is about 8% for patients treated with current standard of care (based on historical data). That does not mean exactly 8% of every possible subset of patients in this population will die. I might enroll 100 consecutive patients and see 9 deaths. I might enroll another 100 consecutive patients and see 6 deaths. In study design, we are using that 8% to stand in for a population-level parameter, whereas the mortality observed in any particular sample is a statistic. More on this in a moment.

Now, further suppose that I have an experimental therapy that I believe is capable of reducing mortality in this population. Of course, I cannot know precisely how much it will reduce mortality by without running a fairly large and extensive study. Lucky for me, someone with some money believes in this compound and says that they’ll give me a few million bucks to study this new drug. If I can prove that my drug saves lives then we will go get FDA approval and, once successful, the company will proceed to collect many millions of dollars, in addition to feeling good that we have helped save patients’ lives.

BUT, since my wealthy benefactor only wants to spend as much money as needed to prove that the drug is effective, they ask me to estimate how many patients the study will need to prove that the drug saves lives.

I think about this. Even as optimistic as I am about my new drug, I know that my drug will not likely reduce that mortality rate from 8 percent all the way to zero. I think my drug is really good, though, so I decide to design my study under the assumption that I can cut the mortality rate in half, from 8 percent to 4 percent. Hopefully we can all agree that is a clinically relevant effect size that we would not want to miss, if one has a therapy that can cut mortality in half.

I perform a power calculation based on the following assumptions/inputs:

Alpha=0.05
Control Group Mortality = 8%
Treated Group Mortality = 4%

Finally, I state that I want my study to have “80% power”

But what exactly does that mean?

The “80% power” number means that, under the conditions outlined above, 80% of the trials designed with sample size X would reject the null hypothesis at the specified alpha level of 0.05.

However, this does not require that the observed effect size within the trial must be greater than the population effect size used to generate the estimate. We are testing against a point null hypothesis that the effect is zero; the significance of the p-value is not tied to beating the control group by a specific margin, but rather proving that the treatment effect is different from the null hypothesis of no effect. The calculations you see here are based on assumptions about an infinite possible population of patients, since we know that many repeated trials are going to yield slightly different results (i.e. if the “true” numbers of 8% control group mortality and 4% treated group mortality are correct, I may still enroll 100 patients into the control group and observe 6 deaths; I might also enroll 100 patients into the treated group and observe 6 deaths; if I simply stopped at that time, I would likely conclude that the treatment is ineffective, even though that is not a terribly implausible result of the first 200 patients under the conditions outlined above).

The power calculation is just a vehicle to help us estimate how many patients we’ll need to enroll to be fairly confident (discussion for another time: why are so many studies designed with 80% power?) that a “null” result did not occur because we simply didn’t enroll enough patients.

Returning to my hypothetical example above: an assumed control group mortality of 8% and powering based on a reduction to 4%, the study would require 550 subjects per arm (1100 subjects total) to have 80% power (for the moment let’s ignore attrition, dropout, etc).

Now suppose I have carried out the experiment and the results are in! I enrolled 550 patients per treatment arm, and we did in fact observe 8% mortality (44 deaths) in the control group with a 3% absolute reduction to 5% mortality (27 deaths) in the treated group. A chi-squared test for the two proportions returns a p-value of 0.036, significant at my chosen alpha level of 0.05.

Note that this matches the conditions you outlined above: the power calculation was based on an estimated 4% absolute reduction in risk, but the observed trial results showed a 3% reduction in risk. The final result was statistically significant. Why? Because the power calculation doesn’t imply that the difference in the trial has to be equal to or larger than the estimated effect size used for the power calculation. We know that the observed results in any given sample are sometimes going to show a larger or smaller difference. A primary objective of the power calculation is to ensure that we have recruited enough patients to be reasonably confident that a “null” result is in fact due to a null effect and not merely due to sampling variability (again, with caveats about whether 80% power is in fact the right threshold for this…)

Would love feedback from other experts on this. Power and sample size calculations have always been an incredibly challenging concept for me, and it took years to feel that I understood them reasonably well, much less that I could explain them to others.


#8

What would you say is the relationship between study power and test sensitivity? What conceptual differences exist between interpreting the result (given in the published study) yielded by a study design with “80% power to detect 4% ARR”, and interpreting the findings (in a given patient) from a physical exam maneuver with 80% sensitivity for such-and-such a diagnosis?


#9

This is the difference between the effect you power at (says something about H1, or the alternative hypothesi) and the effect at which your test is significant (says something about H0, the null hypothesis). I call the latter the “minimal detectable difference”, or MDD.

The 4% are the effect you power at, i.e. if in the population the effect is 4% (your H1), then 80% of trials based on samples of size n (= your sample size) from that population will be statisticially significant with \alpha = 0.05.

However, this does not say much about when your result is “just” significant. The trial will be significant (2-sided) if your observed z-score is \ge z_{1-\alpha/2}, i.e. \hat \delta / SE(\hat \delta) \ge z_{1-\alpha/2}, with \hat \delta the effect size and SE(\hat \delta) its standard error. From this we get that \hat \delta \ge z_{1-\alpha/2} SE(\hat \delta), and this is your MDD. This computation is based on the null. The power calculation, or the sample size n based on the alternative hypothesis, enters through computation of the standard error of the effect estimate. I interpret the MDD as the critical value of the test on the scale of effect of interest.

In fact, when planning a trial we focuss the discussion with our clinicians on the MDD (“What effect will be still clinically relevant?”), not the effect we power the study that. This helps avoiding the situation above, that clinicians were of the opinion that the effect we power the study at needs to be seen in order to be significant, while in fact, significance is also possible for smaller effects.


#10

This is a relevant question and some very good answers were already given. I’d like to add an aspect that is, to my opinion, a main reason for the confusion that leads to questions like yours:

A bit background: power analysis is part of Neymanian A/B testing. This is a behavioral method, aiming to minimize an expected loss or to optimize the expected win in situations where we need a decision between two distinctively different alternatives (A and B) based on noisy data. This is achieved by keeping upper limits on statistical (not real!) error-rates, which need to be defined/set by the researcher according to some sensible loss-function. As we researchers usually have no idea about any sensible loss function, we usually stick to crude rule-of-thumbs, hoping we don’t do too bad too often. We all know these commonly chosen error rates as 0.05 for the type-I (deciding for B in a worlds where A is true) and 0.2 for the type-II error (deciding for A in a world where B is true). The testing procedure is an accepting procedure, not a rejection procedure (as the Fisherian significance test). It is about to behave either as if A was true or as if B was true. At no point there is anything claimed about the trueness or likelihood of any of the alternatives being true or false. This is outside of what we can handle with frequentist methods.

Now to the key point: The estimates are not inferential. They are only “benchmarks based on the data”. They are used internally by the testing procedures (in both, significance and A/B tests, as both are based on the same math, but on completely different philosophies!). The A/B test just tells you how you should behave. It’s not even saying that your data is “significant”. This is all captured by the desired error-rates that detemine the experimental design. This is all already given at the time of data collection, and the entire test is just to “calculate” if this data should then lead us to accept A or to accept B. This has nothing to to with an actual estimate of the effect size based on the sample data. The whole testing procedure was developed just because it is impossible, based only on sampled data, to make any conclusion about the “real” or “true” effect based on frequentist methods (it would require a Bayesian interpretation, what renders the “objective” interpretation of probability obsolete; tests were developed just to avoid this!). Tests are not about estimation. They are only about data, and the rest are assumptions we (should) consider sensible/reasonable.

Hence, although quite commonly done, it makes little sense to report the decision of a test and an estimate. Giving a proper estimate neccesarily needs a Bayesian approach, and a subjective understanding of probability, what is a no-go for tests.

The source of this confusion seems to lie in the ambigiuos use of “estimate”. These can mean sample statistics (as used in the tests), and as such they don’t provide any information about what “we know about” that effect. But these can also represent values we think are “real”, that is, we believe the “real parameter” actually having this value (within some margin or error, for sure). For large samples in relatively simple problems and under seemingly resonable assumptions, a carefully derived “real” estimate will be very similar to the corresponding sample statistic. I think this has caused to use “estimate” also for sample statistics and blurred the distinction between them.


#11

Interesting discussion! I was also struggling with this question a while back and while searching for an answer I found the following paper which cleared a few things up for me: https://www.ncbi.nlm.nih.gov/pubmed/28720618
The focus is on non-significant results but they also discuss significant results in relation to a defined clinically meaningful effect.


#12

This is a much more direct of saying what it took me about 1,000 extra words to say. Well done.

I’ll merely add that much of the confusion (IMO) comes from people struggling to grasp the difference between the population effect and the observed effect in the trial. Hence my long-winded explanation which attempts to illuminate that in more detail.


#13

Interesting discussion. Along similar lines, how do you interpret the below trial.

Original power calculation assumptions:
Control event rate: 13%
Relative reduction: 20%
Alpha: 0.05
Power: 80%

Observed results
Control event rate: 4%
Effect size HR 0.59 (95% CI 0.41 to 0.84; P=0.004).

I am more interested in the false discovery rate given that the observed control event rate is way less than projected. The point estimate of the effect size is too good to be true. If the P-value was not significant, many would have brushed this study off saying it was underpowered given overly optimistic control event rate assumptions in the planning phase of the trial.


#14

I am curious to see other people’s answer, but here are my thoughts on 2 ways to answer this:

  1. we can’t really know FDR of single study, only after repeating a study multiple times could we get an idea. If effect size is really that big, it should be easy to repeat and test.

  2. one could try to assess probability of false Discovery using bayes factor. Given your statement that effect size is “too good to be true”, we could reasonably assume pre-test odds are low, and post-study odds of FDR is high. However, the devil is always in the choice of pre-study odds, which is difficult and contentious given need to include clinical views.


#15

With what Raj wrote in mind, a formal Bayesian analysis should be done, using a skeptical prior and getting a posterior probability of efficacy. But since I believe power is a pre-study characteristic, if you’re not Bayesian you have to go with the confidence interval Sripal provided. And to get such a confidence interval the study would have to have been quite huge with such a low control event rate.


#16

I often use this explanation: If the real effect was RR=0.5, then about half the time your point estimate would be over 0.5 and half the time under. So if you design the trial to give significance exactly when the point estimate is under 0.5, you get 50% power. To make it higher, you must allow less impressive point estimates to be significant.


#17

That is a very sensible and probably easier to follow explanation. Thanks!