Decades of underpowered RCT

One of the most frustrating problems with medical science is the construction of the incomplete bridge. No other industry could build bridges which are incomplete and then simply say “one problem with our bridge is that we did not have enough steel to complete it”.

This is ubiquitous and always ends with a call for another trial leaving the clinician with no answer.

The important question which could have been answered by this RCT remains unanswered. Why can’t this be fixed?

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2835888

2 Likes

Yes, it’s a problem. The traditional way of thinking about trials is that they should be independent and not incorporate information that we already have.

Scott Berry talks about this issue (among others) here

also this datamethods thread

Scott makes the point that when a 1000 patient trial is suggestive but inconclusive (“fails to show significance”) the usual reaction is to do a 2000 patient trial, rather than add maybe a few hundred to the existing evidence. At best this is a waste of resources.

5 Likes

Totally agree, and also it is unclear exactly when it is justified to do a second trial. I’d rather see one really good sequential trial (with N as a random variable) than the mess we have now. A suggested Bayesian strategy that meets multiple clinical goals is here.

3 Likes

Hi,

I have a number of comments, on this general issue and on the study being referenced specifically.

First in a general context, underpowered studies are problematic, not only due to the failure, in a frequentist setting, to reject the null leaving the conclusion ostensibly indefinite, but when the null is rejected, there is a reasonable probability of overestimating the treatment effect, leading to overly optimistic expectations.

In the latter case, there was a 2013 article in Nature by Button et al that discusses this:

Power failure: why small sample size undermines the reliability of neuroscience

To quote from the opening paragraph in the above paper:

It has been claimed and demonstrated that many (and possibly most) of the conclusions drawn from biomedical research are probably false. A central cause for this important problem is that researchers must publish in order to succeed, and publishing is a highly competitive enterprise, with certain kinds of findings more likely to be published than others. Research that produces novel results, statistically significant results (that is, typically p < 0.05) and seemingly ‘clean’ results is more likely to be published. As a consequence, researchers have strong incentives to engage in research practices that make their findings publishable quickly, even if those practices reduce the likelihood that the findings reflect a true (that is, non-null) effect.

Of course, there is Doug Altman’s original 1994 BMJ editorial:

The scandal of poor medical research

and Richard Smith’s 2014 follow up:

Medical research—still a scandal

On the particular study being referenced here:

I am curious as to why they considered this as a phase II study, given the prior research that was available and that the study design was more along the lines of a phase III confirmatory design.

I do not really see anything in the study design that I would label as phase II, in that the target sample size is above the typical phase II two-arm designs, there is no dose-finding component to the design, and they used a two-sided hypothesis where a one-sided hypothesis would likely be more common, increasing the effective power and resulting in a smaller sample size for a phase II study.

The only parameter I see that is suggestive of a phase II design is the lower a priori power of 0.8.

Notwithstanding adaptive study designs, a single phase II study is intended to provide sufficient evidence to enable the decision by relevant parties to proceed or not to proceed with a larger phase III, confirmatory study. In essence, there is an a priori expectation of possibly needing more research given the findings of the phase II study.

I do not see that context here, in that their inference (and in the associated commentary) of the need for a larger, multi-site study, appears to be the result of their failure to meet their target enrollment, not because of a priori intent.

To their credit, they did perform a priori power/sample size estimates, albeit they had problems, as noted, of fulfilling their target enrollment, which can be a common challenge. They enrolled only 81 of the target 150 subjects, anticipating a 20% attrition rate versus the 26% actual.

Further, their observed treatment effect size was lower than hypothesized in the truncated cohort for even their smallest target sample of 84 subjects (total 106 subjects at 20% attrition), which was a 22% absolute reduction versus the 13.5% observed.

The associated commentary would seem to infer that even the observed effect is still clinically meaningful, albeit I would defer to clinical subject matters experts on that.

As we know, the premise of just needing more subjects (e.g. the notion of “trending towards significance”), by presuming that the observed results would be the same as in the available cohort, is highly problematic.

So this particular study was, to my eyes, problematic on a number of fronts and one wonders, perhaps in hindsight and presuming that they knew that they had enrollment problems well advance of stopping the study, what other options were considered (e.g. expanding the number of study sites) to try to meet the original enrollment goals.

That all being said, I agree with the premise of this post, that there are problems generally speaking, and as Button et al noted, many driven by the need to publish.

5 Likes

“..not incorporate information that we already have.”
What would your reference be for this? I don’t see this in clinical trials nor in any books on DOE.

Some of it, in my experience, is put forth as justified by the study team, IRB, and FDA due to feasibility of accrual to a very large trial; furthered by the hope the signal for meaningful benefit will be large.

This is quite an optimistic approach to study design, which could be improved by allowing for early stopping if the effect is really what study leaders dreamed of but would continue if all that’s happening is clinical significance.

1 Like

There’s a lot of that - what we might call feasibility studies! Clinical trials are fairly often terminated for lack of efficacy - which is the responsibility of the data monitoring committees. Another rationale for continuing can be if there are responses for some participants that are deemed clinically meaningful for indications with unmet needs. To be clear, I’m not advocating for any position here, just passing on what I’ve observed as an advocate (in response to an important issue raised)

1 Like

Agreed. On a related note I estimate that of the large number of studies that went to completion and had p > 0.05 they could have stopped early for futility at less than 1/3 of the final sample size.

2 Likes

An important question that could be directed to leadership of the CIRB (NCI centralized IRB)

A late-arriving thought (common for me nowadays): when the primary endpoint is time to progression or PFS, it can be many months even years after full accrual for some indications (such as the indolent lymphomas) to compare outcomes.

Probably better to evaluate this concern on a trial by trial basis.

A more coherent approach: Decide on how much information can be borrowed between the treatment effect on progression and the treatment effect on death, require having a power of 0.5 to detect a death effect on its own, the borrow the agreed-upon information in a Bayesian analysis as proposed here.

To Frank’s comment, I generally agree, albeit the reality is the need to have an experienced statistician involved in the study design and execution to implement a priori interim stopping rules, whether frequentist (e.g. using conditional power) or Bayesian (e.g. using predicted probability of success), and then to have an experienced DMC/DSMB convened with reasonable pre-specified thresholds to engage in decision making during closed (executive) sessions when presented with the data.

Of course, the budget for the study via whatever the funding entity is, industry or otherwise, still needs to be sufficient to cover the full costs of the study, including the DMC/DSMB, should it go to completion, presuming proper a priori power/sample size estimates were done, which is yet another issue…

Over the past 20 years, give or take, where I have sat on monitoring committees for IDE/PII/PIII trials, formal pre-specified interim stopping rules (safety, efficacy, futility) have been present only in a minority of them, albeit there have been interim sample size adjustments in some.

Without going into details, in one example of a PIII pivotal trial with formal interim stopping rules, with 3 pre-defined interim analysis time points and conditional power thresholds, and in the absence of clear safety concerns, the trend in conditional power was not favorable for the study over the multiple interim analyses. However, the strict threshold for stopping the study for futility (CP = 0.10) was never crossed, even though it dropped below 0.20 by the second interim analysis, at which time enrollment was still underway.

In the end, the study went to full enrollment and completion but failed the primary endpoint. In hindsight, the study probably could have been stopped for futility earlier, perhaps with a different decision framework.

In post study discussions, there were considerations for how might different decisions have been made, in the setting of a very strict threshold of CP = 0.10 for a futility decision, and where the disposition was to “give the trial every chance to succeed”.

I would also note that generally, recommendations made by the DMC/DSMB to the sponsor are not binding per the committee charter, thus still leaving some discretion on the part of the sponsor.

One option would have been to provide a pre-defined communication channel to sponsor related persons that were not directly involved in the conduct of the study, so as to not bias the conduct of the study by having knowledge of interim findings and concerns.

If that was present, I suspect that when the CP dropped below 0.2, there would have been preliminary communications via that channel regarding the study. A decision could have been made to stop at that time, or again in the absence of safety concerns, affirm that the sponsor wanted to continue until and unless the CP dropped to 0.10.

I suspect that even when experienced personnel are available, both in the design and the conduct of the study, stopping rules for futility are still ill-defined, probably depend upon the specific details of the study and the disease being treated, in terms of the motivations to continue the study, even when the outlook may not be favorable during interim analyses.

2 Likes

Great points Marc. In the Bayesian vein, faster stopping results from having many more data looks and also from not using a null point but a trivial treatment effect point of reference. For example a high probability that the treatment effect is less than trivial will result in earlier stopping than using a posterior probability of it being < 0.

Minor point: It is simpler to not use predicted probability of success but just to use P(inefficacy) as a futility criterion as exemplified here.

The budget point is especially good, and leads to consideration of platform trials for creating real opportunities for aggressive futility stopping, because with platforms you have other trials to switch to.

4 Likes

Thank you. Would not the IRB have authority to suspend the study based the DMC/DSMB recommendation as a means to protect the participants from harm?

1 Like

Yes, the IRB would have the authority to either suspend or terminate a study, given the facts being reported to them by the sponsor. The same would be the case for the FDA, in the setting of a regulatory submission.

If a DMC/DSMB has been convened for the study and they have made a formal recommendation to stop the study, the sponsor is obligated to investigate that recommendation and respond back to the DMC/DSMB in a timely manner.

Recall that the DMC/DSMB recommendation is not binding on the sponsor, albeit, if the sponsor elects to reject the recommendation, they need to document the rationale for that decision and report it back to the DMC/DSMB. If the sponsor elects to reject the recommendation, the DMC/DSMB members would then each have to decide, given the facts on the ground, if they wish to continue in their role or resign.

If the DMC/DSMB recommendation is based upon a clear safety concern, especially in the setting of unexpected SAEs that can be linked to the study treatment, the sponsor would have an obligation to report that to the relevant regulatory bodies in a timely manner, already being reported separately as per regulations. That would kick start the formal discussions vis-a-vis the suspension and/or termination of the study.

If the recommendation by the DMC/DSMB is purely based upon the poor trajectory of enrollment, in the absence of clear safety concerns as above, and the sponsor has already engaged in various attempts to improve enrollment, then a recommendation may be made to stop the study because the risk/benefit consideration is no longer favorable.

An important context is that the role of the DMC/DSMB is advisory to the sponsor. In general, under the current guidelines by the FDA from 2006 for the convening of a DMC/DSMB, only under the most extraordinary circumstances would there be any direct communications between a DMC/DSMB and the IRB/FDA. Thus, the sponsor is the primary intermediary in that communication channel and the obligation is on them to facilitate communications to the regulatory bodies as I reference above.

The FDA did release new draft guidance last year (2024), where some of the changes in the document address some of the parameters around the communications between the FDA and the DMC/DSMB. If the draft becomes final in time, that could provide some additional parameters where there may be more direct communications between the IRB/FDA and a DMC/DSMB as circumstances warrant.

4 Likes

Just curious … how could such a notion be any more sound than post-hoc power? (Indeed, loosely translating ‘interim’ or ‘up-to-now’ as ad hoc, the term ‘ad hoc power’ seems apt.)

Perhaps this originates primarily from legal/regulatory considerations:


https://www.fda.gov/media/172166/download

1 Like

A key differentiation here with formal interim stopping rules, as opposed to purely post-hoc power after the study is completed, is that these parameters are all pre-specified in the approved study protocol. That includes the timing of the interim analyses, the method(s) by which conditional power would be estimated, and the thresholds for interim decision making, such as futility and perhaps interim sample size re-estimation.

In essence, you are pre-specifying interim looks at the data to maximize the likelihood that the study will, in the frequentist context, reject the null at study end. If the probability of that outcome drops below a pre-defined threshold, conditional power usually 0.1 or 0.2, it would be futile to continue the study and expose more human beings to the risks of the study. Thus, these decisions would typically be made while enrollment is still actively underway to allow the DMC/DSMB to make a recommendation to stop further enrollment at that point.

In general, conditional power would be estimated using a dataset with parameters based upon conditional assumptions relative to the endpoints, commonly:

  1. Take the observed endpoint data already available at the time of the interim analysis, combine that with synthetic endpoint data where the assumption is that the subject endpoints yet to be observed will be consistent with the original pre-study assumptions.

  2. Take the observed endpoint data already available at the time of the interim analysis, combine that with synthetic endpoint data where the assumption is that the subject endpoints yet to be observed will be consistent with the already observed data.

In both cases, the observed endpoint data would have been already adjudicated by the study clinical events committee (CEC) such that these events would not be subject to further changes or dispute.

Depending upon the direction and magnitude of the treatment effect in the observed endpoint data versus the pre-study assumptions, the two approaches above can lead to optimistic versus conservative views of the probability of study success and enable the DMC/DSMB to consider the totality of the information to facilitate their decision to continue or to stop the study.

In the setting of pre-defined sample size re-estimation, there may be additional conditional power thresholds, for example a “Promising Zone”, where the study protocol would allow for a pre-defined increase in the study sample size, up to a pre-defined maximum value. In essence, you might increase the study sample size to retain the desired pre-study power assumptions, given that the study would then be likely to reject the null, rather than risking the study ending with an inconclusive result. This would generally be in the context where the treatment effect size likely to be observed in the study may be smaller than hypothesized pre-study, perhaps because that assumption was optimistic, but the interim data show a treatment effect size that is still clinically meaningful.

There may be other “zones” defined in this setting, such as “Favorable” where the conditional power is still high enough to not yet warrant an increase in sample size (e.g. >0.8), and “Unfavorable” where the conditional power (e.g. <0.4) is lower than the Promising Zone (e.g. 0.4 - 0.8), such that you would not increase the sample size but it is still above the futility threshold such that you would not yet stop the study enrollment.

All of the above add potential complexity and costs to the study, as well as considerations for the impacts to type I errors and the need for adjustments. Thus, these issues would all be addressed during the study design phase, and as noted, all pre-defined in the approved study protocol.

2 Likes

Ah, so it’s because there remains a forward-looking component to ‘conditional power’ (manifested in synthetic/simulated future data) that the notion can still make sense?

In essence, that is correct. :grinning_face:

1 Like