Is Repeated Bayesian Interim Analysis Consequence-Free?

Johannes_Schwenke · August 20, 2025, 10:56am

Being still new to Bayesian analysis, and currently focusing more and more on platform trials and interim analyses, I thought this was interesting. Interested to hear others thoughts on the paper!

Abstract:

Interim analyses are commonly used in clinical trials to enable early stopping for efficacy, futility, or safety. While their impact on frequentist operating characteristics is well studied and broadly understood, the effect of repeated Bayesian interim analyses - when conducted without appropriate multiplicity adjustment - remains an area of active debate. In this article, we provide both theoretical justification and numerical evidence illustrating how such analyses affect key inferential properties, including bias, mean squared error, the coverage probability of posterior credible intervals, false discovery rate, familywise error rate, and power. Our findings demonstrate that Bayesian interim analyses can significantly alter a trial’s operating characteristics, even when the prior used for Bayesian inference is correctly specified and aligned with the data-generating process. Extensive simulation studies, covering a variety of endpoints, trial designs (single-arm and two-arm randomized controlled trials), and scenarios with both correctly specified and misspecified priors, support theoretical insights. Collectively, these results underscore the necessity of appropriate adjustment, thoughtful prior specification, and comprehensive evaluation to ensure valid and reliable inference in Bayesian adaptive trial designs.

R_cubed · August 20, 2025, 1:28pm

I haven’t read the entire paper yet, but the fundamental criteria that Frequentists and Bayesians use are related, but fundamentally different and can’t be unified except in special cases. The former are concerned with repeated sampling, while the latter are concerned with a particular a specific finite sample, and coherence. This is more of a problem with hypothesis testing itself than estimation, where Frequentists and Bayesians have little to disagree upon.

I hope to read the papers over the weekend and maybe they will give some more subtle insights. For now, I think the following items are relevant:

Royall defends a likelihood approach to statistical evidence in this paper.

Royall, R. (2000). On the Probability of Observing Misleading Statistical Evidence. Journal of the American Statistical Association, 95(451), 760–768. https://doi.org/10.2307/2669456

Greenland, S. (2023). Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: Or, how divergence P-values measure evidence even when decision P-values do not. Scand J Statist, 50(1), 54–88. https://doi.org/10.1111/sjos.12625

It is commonly assumed that realizations of such decision P-values always correspond to divergence P-values. But this need not be so: Decision P-values can violate intuitive single-sample coherence criteria where divergence P-values do not. It is thus argued that divergence and decision P-values should be carefully distinguished in teaching, and that divergence P-values are the relevant choice when the analysis goal is to summarize evidence rather than implement a decision rule.

Pavlos_Msaouel · August 20, 2025, 2:23pm

For context, the main authors here are highly experienced in Bayesian trial design. Which is why they know when and why to stress-test these approaches. I have personally developed, led and completed >5 investigator-initiated clinical trials with a Bayesian design, all with interim analyses, and many with the help of these authors. Twice more such trials are ongoing or in development and will include Bayesian interim analyses.

The article is in alignment, e.g., with this post by @Stephen noting that Bayesian designs do not eliminate multiplicity concerns. There is no free lunch. But there is another key nuance: type I error control (for those that care about it) is not affected by simply reviewing the accumulating trial data – we can do that for example when auditing data quality. It is affected only when we plan to act on it. For example, if our primary endpoint is to refute the null that two therapies have a survival difference then having an interim analysis that could stop the trial early for efficacy on this endpoint would affect type 1 error control because it would give us extra shots at claiming superiority. But if the interim analyses are for futility checks (or other purposes like sample size re-estimation, response adaptive randomization etc) then there is no penalty for multiplicity. Which is why all the interim analyses for my trials are for these purposes and not to establish early efficacy. With that in mind, notice what the paper focuses on

This does not mean of course that I am personally against all Bayesian interim analyses for early efficacy on the primary endpoint. Our recent investigation of the value that Bayesian interim analyses would have had in 230 randomized two-arm parallel oncology phase III trials suggested that early stopping rules informed by such analyses (either for early efficacy or futility) could have saved a median of $851 million cumulative total (upwards to $1.54 billion or as low as $614 million)) or an average of $6.7 million per trial. In theory most Bayesian modeling could be done in a frequentist manner but Bayes is a more natural tool for this purpose, which is why even those phase 3 oncology RCTs that used frequentist interim analyses could have saved $4.6 million per trial if they used Bayes.

This is not surprising. The excellent trial design podcast “In the Interim…” regularly describes in an accessible manner the value of interim monitoring in trials (see the “Spending alpha” episode from a frequentist perspective) highlighting the natural use of Bayesian modeling for this purpose.

f2harrell · August 20, 2025, 3:19pm

The paper mixes two ideas, which creates a great deal of confusion: choice of priors, and sequential assessment. These ideas need to be separated. Differences of opinion about priors will have a similar effect whether talking about interim analyses or final analyses. The paper would be more more clear had it concentrated on sequential analysis only. When doing that it becomes clear that in the Bayesian paradigm, analyzing data at n=200 in a study that planned to enroll 200 patients gives exactly the same evidence as an early look at n=200 for a planned study of n=800. Posterior probabilities do not change interpretation because of stopping rules.

The authors wrote the manuscript from a standpoint that somehow “blesses” the planned final sample size, giving it special meaning that it never has. One researcher’s interim analysis is another person’s final analysis. All statisticians know that fixed target sample sizes are always derived using voodoo and are arbitrary is almost every way except for budgeting purposes.

Any study of the effect of multiple testing in the Bayesian world needs to respect Bayesian operating characteristics, discussed in detail here. For example, when considering the Bayesian operating characteristic for an assessment of treatment superiority, compute the posterior probability of superiority and stop the moment it exceeds a threshold such as 0.95. Then reveal the actual efficacy used to generate the data and see if the efficacy was positive. The number of times the true data generating efficacy was positive divided by the number of simulated studies in which there was sufficient evidence for efficacy is THE Bayesian operating characteristic for this situation. You will find as I’ve simulated in the link given above that this operating characteristic is perfectly preserved with infinitely many data looks. For example if the decision threshold is P(efficacy > 0 | data, prior) > 0.95 you may find that the average probability of efficacy at the moment of study stopping is 0.96. If the sampling prior equals the analysis prior, you’ll find that the proportion of trials with efficacy > 0 is exactly 0.96. The only relevant and interesting study of operating characteristics is what happens to decision accuracy when there is a mismatch between sampling (simulation) prior and analysis prior. You may find that the accuracy goes from 0.96 to 0.93, for example, validating the excellent performance of the Bayesian decision rule even under prior mismatch.

The authors of the manuscript made a critical decision that Bayes is interested in hypothesis testing. It is not. Bayes deals with the probability that an assertion is true. Assertions are almost always about intervals of unknown treatment effects and having nothing to do with single points. This has everything to do with how you simulate the performance of a decision rule.

Bayes has nothing to do with \alpha. Any attempt to make it deal with \alpha will result in

a hard-to-interpret mess
a frequentist procedure and not a Bayesian one
violation of the likelihood principle
\alpha will increase due to having the opportunity to make a decision that was not actually made
much more complex simulations needed
stopping much later for futility, resulting in wasted time and money
most importantly, a reduction in the probability that made decisions are correct

It is extremely important to note that P(made decision is correct) is an unconditional probability in the sense that it does not condition on any unobservable quantity such as the true treatment effect. It is the accuracy of the decision in the field, not an assessment of what might happen if the treatment were to be ignorable. \alpha is a pre-study concept that makes no use of real data in any way.

Johannes_Schwenke · August 21, 2025, 3:16pm

@Pavlos_Msaouel Thanks for the nice resources. I’m an avid listener of In the Interim as well, and try to read pretty much everything from Stephen.

Yes, of course. I found @f2harrell’s approach to interim analyses by defining a threshold for relevant benefit, and then terminating if P(no relevant benefit) is very high, eye-opening. But I haven’t found a use case in my research so far, though hopefully I will.

@f2harrell I’ll have to ponder your approach for a while, as it’s very foreign to my more frequentist mind.

timdisher · August 21, 2025, 4:22pm

Interesting article. Still processing and likely will for some time but I’m a little surprised at some of the findings under the true data generating process. I do get a bit of a weird feeling in my gut re: how errors are defined and how they create conditional probabilities that lead to calibrated estimates not giving calibrated decisions which I just have difficulty wrapping my head around. I’m more used to seeing these studies simulated in the way @f2harrell has layed out and then potentially repeated under standard fixed parameter frequentist approach to assess frequentist operating characteristics. It would be interesting to hear from the authors more about why it might make sense to try to get frequentist style errors in an otherwise Bayesian simulation given the conditioning leads to some weird findings.

f2harrell · August 21, 2025, 4:26pm

Details with a comprehensive example are at 9 Bayesian Clinical Trial Design – Introduction to Bayes for Evaluating Treatments

Weird findings is correct. It is essential to evaluate Bayes on what Bayes is trying to do. Otherwise it’s like saying to the winner of a 100 meter race that his victory doesn’t count unless he also wins another race by running backwards.

kertviele · August 25, 2025, 2:21pm

First off, many thanks to everyone listening to the “In the Interim” podcast, on behalf of all of Berry! Suggestions for topics welcome!

I agree there are side points to the paper regarding prior mismatch, and their metrics may not be my metrics, but I’m ok with the main message of the paper that a Bayesian selecting a trial design can and should value fixed trials and trials with interims differently. This is a question of perspective. If I am viewing fixed existing data (analysis of a completed experiment, reading a research paper), then I no longer care about the experimental design. The data is the data. If I am a sponsor selecting a trial design, then different trial designs have different probabilities of generating informative data. This paper is written from that perspective. You could change out the metrics as you prefer and get similar conclusions.

I posted this analogy on linkedin
consider two lottery tickets
A pays $0 with probability 90%, $1000 wp 10%
B pays $0 with probability 50%, $1000 wp 50%

After the drawing, if the two tickets have the same payout, you don’t care which ticket you have. A paying $1000 is the same as B paying $1000. Similar to a Bayesian viewing data identically regardless of the design that generated it. However, prior to the drawing, you should certain view the tickets differently, and strongly prefer B. Similar to how a Bayesian should view different experiments as having different probabilities of generating informative data (or lower costs, or whatever utilities matter).

It can’t be the case that the only criteria for a design is “we conduct a valid Bayesian procedure with the final data”. That would be satisfied using N=1 (or even 0!). We need our designs to produce information that allows us to treat patients better. And different designs have different value in doing that. If those different designs happen to get to the same place, then yes patients get the same value. But in selecting a design we have to consider that some designs more likely to get to a better place than other designs. I also posted a simple quantitative example of this on linkedin

Pavlos_Msaouel · August 25, 2025, 2:35pm

The topic of this thread would make an interesting podcast

f2harrell · August 25, 2025, 4:28pm

Hi Kert! I’m really glad you have joined datamethods.

I think there are two classes of Bayesian operating characteristics, and I think you have combined them in your description. The first is the correctness of decisions, for any direction (efficacy, harm, etc.) of decision. That is P(decision is correct | data, prior). The second is the Bayesian power, i.e., sensitivity of the Bayesian design to detect a relevant treatment effect, i.e., P(P(effect > epsilon | data, prior’) > 0.95) where prior’ is the subspace of the sampling (simulation) prior that corresponds to benefits not to miss. This will typically be the uncertainty distribution for the MCID. Dealing with Bayesian power is what prevents your N=1 example from happening.

My interpretation of your lottery ticket example is that you are emphasizing the role of tendencies when choosing a design. This is an excellent point because before we have any data collected, and when we don’t have excellent data from a previous experiment, we must rely on tendencies. This is neither a Bayesian or a frequentist idea, but the particular tendencies that you model will differ by paradigm. In your discussion about \alpha and \beta you are entertaining the notion that tendencies can be based on frequentist ideas, which rely heavily on unobservables. This is the most common approach but I do not adopt this approach. We can instead consider tendencies of purely Bayesian quantities and avoid unobservables as I describe below.

Perhaps a more compelling example than the lottery example, for just the present purpose, would be along the lines of this NFL problem which I’ll modify for illustration. Suppose that you had two goals: to be correct about your team winning and to not spend too much time watching the game. One person bases his decision on the score at the end of the half. A second person continuously assesses the score and the current probability of ultimate victory, all throughout the first half. Both persons will make accurate decisions. The second person will make the decision more quickly on average. Continuous assessment by person 2 does not modify any relevant operating characteristics except for the expected sample size.

A Bayesian can select an experimental design using strictly Bayesian operating characteristics, and obtain excellent operating characteristics that matter. An example here shows how all goals in a set of goals can be achieved simultaneously, including correctness of decisions, Bayesian power, precision, and low expected sample size.

I need to elaborate on P(decision is correct | data, prior). The actual posterior probability computed in this situation is the probability the treatment doesn’t work, whether or not we conclude that it does. Writing it out fully, what we are trying to prevent is concluding that the treatment effect is more than trivial (with triviality threshold epsilon) when the treatment effect is really < epsilon. The posterior error probability is P(effect < epsilon | data, analysis prior) when epsilon is drawn from the sampling (simulation) prior which may deviate from the analysis prior. Better notation is P(effect < epsilon | data, analysis prior, sampling prior), meaning that it is computed assuming the pre-specified analysis prior, while the universe of treatment effects we are sampling from is specified by the sampling prior. The purpose of this kind of fully pre-specified sensitivity analysis is to show that the study’s conclusions are not too sensitive to the choice of analysis prior when they shouldn’t be.

None of this is affected by multiple looks other than the fact that you need to carefully specify which conditions need to hold with which probabilities as shown in the example linked above. In that example, I put a minimum N requirement before looking for evidence of non-trivial efficacy. This requirement essentially says that “if we stop early with evidence for efficacy we must be able to have a reasonable precision (e.g., width of posterior distribution) of the amount of treatment benefit.” If we stop early for inefficacy, this design jettisons any ability to well-estimate just how poor the treatment performs.

Big picture: Excellent designs are important to Bayesians, and they dictate how often we look at the data, make adaptations, etc. The goal of having an excellent pre-specified design does not change which operating characteristics we need to use.

Bayesians do not profit from sampling distributions except in possibly in one situation: simulating tendencies of specific experimental designs. For that we have to simulate repeated (using 1000 - 10000) experiments and run the Bayesian procedure to compute the probabilities of correct decisions, Bayesian power, expected sample size, etc. How the simulations are done is markedly different from the frequentist approach because of not relying on unobservables during the analysis. Bayes doesn’t ask for the probability of asserting an effect were the effect to magically be zero, i.e., when any assertion of effect is by definition wrong. Instead the sampling prior specifies the universe of effects and each simulated trial has its data created for a different effect size. Then the Bayesian no-unobservables analysis is run and then we finally reveal the effect in play to judge correctness of a decision based on posterior probability.

Instead of testing null hypotheses, the goal of Bayesian posterior inference is to uncover the hidden truth generated the data, no matter what that truth was. An excellent example of the Bayesian “uncover the data generating mechanism as much as the data and prior will allow” philosophy is in the successful application of Bayes in neuroscience. It is easy to see which brain region is associated with a stimulus to the left big toe. To uncover which stimulus created a certain brain activation requires a reversal, with Bayes’ rule to the rescue.