Being still new to Bayesian analysis, and currently focusing more and more on platform trials and interim analyses, I thought this was interesting. Interested to hear others thoughts on the paper!
Abstract:
Interim analyses are commonly used in clinical trials to enable early stopping for efficacy, futility, or safety. While their impact on frequentist operating characteristics is well studied and broadly understood, the effect of repeated Bayesian interim analyses - when conducted without appropriate multiplicity adjustment - remains an area of active debate. In this article, we provide both theoretical justification and numerical evidence illustrating how such analyses affect key inferential properties, including bias, mean squared error, the coverage probability of posterior credible intervals, false discovery rate, familywise error rate, and power. Our findings demonstrate that Bayesian interim analyses can significantly alter a trialâs operating characteristics, even when the prior used for Bayesian inference is correctly specified and aligned with the data-generating process. Extensive simulation studies, covering a variety of endpoints, trial designs (single-arm and two-arm randomized controlled trials), and scenarios with both correctly specified and misspecified priors, support theoretical insights. Collectively, these results underscore the necessity of appropriate adjustment, thoughtful prior specification, and comprehensive evaluation to ensure valid and reliable inference in Bayesian adaptive trial designs.
I havenât read the entire paper yet, but the fundamental criteria that Frequentists and Bayesians use are related, but fundamentally different and canât be unified except in special cases. The former are concerned with repeated sampling, while the latter are concerned with a particular a specific finite sample, and coherence. This is more of a problem with hypothesis testing itself than estimation, where Frequentists and Bayesians have little to disagree upon.
I hope to read the papers over the weekend and maybe they will give some more subtle insights. For now, I think the following items are relevant:
Royall defends a likelihood approach to statistical evidence in this paper.
Royall, R. (2000). On the Probability of Observing Misleading Statistical Evidence. Journal of the American Statistical Association, 95(451), 760â768. https://doi.org/10.2307/2669456
Greenland, S. (2023). Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: Or, how divergence P-values measure evidence even when decision P-values do not. Scand J Statist, 50(1), 54â88. https://doi.org/10.1111/sjos.12625
It is commonly assumed that realizations of such decision P-values always correspond to divergence P-values. But this need not be so: Decision P-values can violate intuitive single-sample coherence criteria where divergence P-values do not. It is thus argued that divergence and decision P-values should be carefully distinguished in teaching, and that divergence P-values are the relevant choice when the analysis goal is to summarize evidence rather than implement a decision rule.
For context, the main authors here are highly experienced in Bayesian trial design. Which is why they know when and why to stress-test these approaches. I have personally developed, led and completed >5 investigator-initiated clinical trials with a Bayesian design, all with interim analyses, and many with the help of these authors. Twice more such trials are ongoing or in development and will include Bayesian interim analyses.
The article is in alignment, e.g., with this post by @Stephen noting that Bayesian designs do not eliminate multiplicity concerns. There is no free lunch. But there is another key nuance: type I error control (for those that care about it) is not affected by simply reviewing the accumulating trial data â we can do that for example when auditing data quality. It is affected only when we plan to act on it. For example, if our primary endpoint is to refute the null that two therapies have a survival difference then having an interim analysis that could stop the trial early for efficacy on this endpoint would affect type 1 error control because it would give us extra shots at claiming superiority. But if the interim analyses are for futility checks (or other purposes like sample size re-estimation, response adaptive randomization etc) then there is no penalty for multiplicity. Which is why all the interim analyses for my trials are for these purposes and not to establish early efficacy. With that in mind, notice what the paper focuses on
This does not mean of course that I am personally against all Bayesian interim analyses for early efficacy on the primary endpoint. Our recent investigation of the value that Bayesian interim analyses would have had in 230 randomized two-arm parallel oncology phase III trials suggested that early stopping rules informed by such analyses (either for early efficacy or futility) could have saved a median of $851 million cumulative total (upwards to $1.54 billion or as low as $614 million)) or an average of $6.7 million per trial. In theory most Bayesian modeling could be done in a frequentist manner but Bayes is a more natural tool for this purpose, which is why even those phase 3 oncology RCTs that used frequentist interim analyses could have saved $4.6 million per trial if they used Bayes.
This is not surprising. The excellent trial design podcast âIn the InterimâŚâ regularly describes in an accessible manner the value of interim monitoring in trials (see the âSpending alphaâ episode from a frequentist perspective) highlighting the natural use of Bayesian modeling for this purpose.
The paper mixes two ideas, which creates a great deal of confusion: choice of priors, and sequential assessment. These ideas need to be separated. Differences of opinion about priors will have a similar effect whether talking about interim analyses or final analyses. The paper would be more more clear had it concentrated on sequential analysis only. When doing that it becomes clear that in the Bayesian paradigm, analyzing data at n=200 in a study that planned to enroll 200 patients gives exactly the same evidence as an early look at n=200 for a planned study of n=800. Posterior probabilities do not change interpretation because of stopping rules.
The authors wrote the manuscript from a standpoint that somehow âblessesâ the planned final sample size, giving it special meaning that it never has. One researcherâs interim analysis is another personâs final analysis. All statisticians know that fixed target sample sizes are always derived using voodoo and are arbitrary is almost every way except for budgeting purposes.
Any study of the effect of multiple testing in the Bayesian world needs to respect Bayesian operating characteristics, discussed in detail here. For example, when considering the Bayesian operating characteristic for an assessment of treatment superiority, compute the posterior probability of superiority and stop the moment it exceeds a threshold such as 0.95. Then reveal the actual efficacy used to generate the data and see if the efficacy was positive. The number of times the true data generating efficacy was positive divided by the number of simulated studies in which there was sufficient evidence for efficacy is THE Bayesian operating characteristic for this situation. You will find as Iâve simulated in the link given above that this operating characteristic is perfectly preserved with infinitely many data looks. For example if the decision threshold is P(efficacy > 0 | data, prior) > 0.95 you may find that the average probability of efficacy at the moment of study stopping is 0.96. If the sampling prior equals the analysis prior, youâll find that the proportion of trials with efficacy > 0 is exactly 0.96. The only relevant and interesting study of operating characteristics is what happens to decision accuracy when there is a mismatch between sampling (simulation) prior and analysis prior. You may find that the accuracy goes from 0.96 to 0.93, for example, validating the excellent performance of the Bayesian decision rule even under prior mismatch.
The authors of the manuscript made a critical decision that Bayes is interested in hypothesis testing. It is not. Bayes deals with the probability that an assertion is true. Assertions are almost always about intervals of unknown treatment effects and having nothing to do with single points. This has everything to do with how you simulate the performance of a decision rule.
Bayes has nothing to do with \alpha. Any attempt to make it deal with \alpha will result in
a hard-to-interpret mess
a frequentist procedure and not a Bayesian one
violation of the likelihood principle
\alpha will increase due to having the opportunity to make a decision that was not actually made
much more complex simulations needed
stopping much later for futility, resulting in wasted time and money
most importantly, a reduction in the probability that made decisions are correct
It is extremely important to note that P(made decision is correct) is an unconditional probability in the sense that it does not condition on any unobservable quantity such as the true treatment effect. It is the accuracy of the decision in the field, not an assessment of what might happen if the treatment were to be ignorable. \alpha is a pre-study concept that makes no use of real data in any way.
@Pavlos_Msaouel Thanks for the nice resources. Iâm an avid listener of In the Interim as well, and try to read pretty much everything from Stephen.
Yes, of course. I found @f2harrellâs approach to interim analyses by defining a threshold for relevant benefit, and then terminating if P(no relevant benefit) is very high, eye-opening. But I havenât found a use case in my research so far, though hopefully I will.
@f2harrell Iâll have to ponder your approach for a while, as itâs very foreign to my more frequentist mind.
Interesting article. Still processing and likely will for some time but Iâm a little surprised at some of the findings under the true data generating process. I do get a bit of a weird feeling in my gut re: how errors are defined and how they create conditional probabilities that lead to calibrated estimates not giving calibrated decisions which I just have difficulty wrapping my head around. Iâm more used to seeing these studies simulated in the way @f2harrell has layed out and then potentially repeated under standard fixed parameter frequentist approach to assess frequentist operating characteristics. It would be interesting to hear from the authors more about why it might make sense to try to get frequentist style errors in an otherwise Bayesian simulation given the conditioning leads to some weird findings.
Weird findings is correct. It is essential to evaluate Bayes on what Bayes is trying to do. Otherwise itâs like saying to the winner of a 100 meter race that his victory doesnât count unless he also wins another race by running backwards.
First off, many thanks to everyone listening to the âIn the Interimâ podcast, on behalf of all of Berry! Suggestions for topics welcome!
I agree there are side points to the paper regarding prior mismatch, and their metrics may not be my metrics, but Iâm ok with the main message of the paper that a Bayesian selecting a trial design can and should value fixed trials and trials with interims differently. This is a question of perspective. If I am viewing fixed existing data (analysis of a completed experiment, reading a research paper), then I no longer care about the experimental design. The data is the data. If I am a sponsor selecting a trial design, then different trial designs have different probabilities of generating informative data. This paper is written from that perspective. You could change out the metrics as you prefer and get similar conclusions.
I posted this analogy on linkedin
consider two lottery tickets
A pays $0 with probability 90%, $1000 wp 10%
B pays $0 with probability 50%, $1000 wp 50%
After the drawing, if the two tickets have the same payout, you donât care which ticket you have. A paying $1000 is the same as B paying $1000. Similar to a Bayesian viewing data identically regardless of the design that generated it. However, prior to the drawing, you should certain view the tickets differently, and strongly prefer B. Similar to how a Bayesian should view different experiments as having different probabilities of generating informative data (or lower costs, or whatever utilities matter).
It canât be the case that the only criteria for a design is âwe conduct a valid Bayesian procedure with the final dataâ. That would be satisfied using N=1 (or even 0!). We need our designs to produce information that allows us to treat patients better. And different designs have different value in doing that. If those different designs happen to get to the same place, then yes patients get the same value. But in selecting a design we have to consider that some designs more likely to get to a better place than other designs. I also posted a simple quantitative example of this on linkedin
Hi Kert! Iâm really glad you have joined datamethods.
I think there are two classes of Bayesian operating characteristics, and I think you have combined them in your description. The first is the correctness of decisions, for any direction (efficacy, harm, etc.) of decision. That is P(decision is correct | data, prior). The second is the Bayesian power, i.e., sensitivity of the Bayesian design to detect a relevant treatment effect, i.e., P(P(effect > epsilon | data, priorâ) > 0.95) where priorâ is the subspace of the sampling (simulation) prior that corresponds to benefits not to miss. This will typically be the uncertainty distribution for the MCID. Dealing with Bayesian power is what prevents your N=1 example from happening.
My interpretation of your lottery ticket example is that you are emphasizing the role of tendencies when choosing a design. This is an excellent point because before we have any data collected, and when we donât have excellent data from a previous experiment, we must rely on tendencies. This is neither a Bayesian or a frequentist idea, but the particular tendencies that you model will differ by paradigm. In your discussion about \alpha and \beta you are entertaining the notion that tendencies can be based on frequentist ideas, which rely heavily on unobservables. This is the most common approach but I do not adopt this approach. We can instead consider tendencies of purely Bayesian quantities and avoid unobservables as I describe below.
Perhaps a more compelling example than the lottery example, for just the present purpose, would be along the lines of this NFL problem which Iâll modify for illustration. Suppose that you had two goals: to be correct about your team winning and to not spend too much time watching the game. One person bases his decision on the score at the end of the half. A second person continuously assesses the score and the current probability of ultimate victory, all throughout the first half. Both persons will make accurate decisions. The second person will make the decision more quickly on average. Continuous assessment by person 2 does not modify any relevant operating characteristics except for the expected sample size.
A Bayesian can select an experimental design using strictly Bayesian operating characteristics, and obtain excellent operating characteristics that matter. An example here shows how all goals in a set of goals can be achieved simultaneously, including correctness of decisions, Bayesian power, precision, and low expected sample size.
I need to elaborate on P(decision is correct | data, prior). The actual posterior probability computed in this situation is the probability the treatment doesnât work, whether or not we conclude that it does. Writing it out fully, what we are trying to prevent is concluding that the treatment effect is more than trivial (with triviality threshold epsilon) when the treatment effect is really < epsilon. The posterior error probability is P(effect < epsilon | data, analysis prior) when epsilon is drawn from the sampling (simulation) prior which may deviate from the analysis prior. Better notation is P(effect < epsilon | data, analysis prior, sampling prior), meaning that it is computed assuming the pre-specified analysis prior, while the universe of treatment effects we are sampling from is specified by the sampling prior. The purpose of this kind of fully pre-specified sensitivity analysis is to show that the studyâs conclusions are not too sensitive to the choice of analysis prior when they shouldnât be.
None of this is affected by multiple looks other than the fact that you need to carefully specify which conditions need to hold with which probabilities as shown in the example linked above. In that example, I put a minimum N requirement before looking for evidence of non-trivial efficacy. This requirement essentially says that âif we stop early with evidence for efficacy we must be able to have a reasonable precision (e.g., width of posterior distribution) of the amount of treatment benefit.â If we stop early for inefficacy, this design jettisons any ability to well-estimate just how poor the treatment performs.
Big picture: Excellent designs are important to Bayesians, and they dictate how often we look at the data, make adaptations, etc. The goal of having an excellent pre-specified design does not change which operating characteristics we need to use.
Bayesians do not profit from sampling distributions except in possibly in one situation: simulating tendencies of specific experimental designs. For that we have to simulate repeated (using 1000 - 10000) experiments and run the Bayesian procedure to compute the probabilities of correct decisions, Bayesian power, expected sample size, etc. How the simulations are done is markedly different from the frequentist approach because of not relying on unobservables during the analysis. Bayes doesnât ask for the probability of asserting an effect were the effect to magically be zero, i.e., when any assertion of effect is by definition wrong. Instead the sampling prior specifies the universe of effects and each simulated trial has its data created for a different effect size. Then the Bayesian no-unobservables analysis is run and then we finally reveal the effect in play to judge correctness of a decision based on posterior probability.
Instead of testing null hypotheses, the goal of Bayesian posterior inference is to uncover the hidden truth generated the data, no matter what that truth was. An excellent example of the Bayesian âuncover the data generating mechanism as much as the data and prior will allowâ philosophy is in the successful application of Bayes in neuroscience. It is easy to see which brain region is associated with a stimulus to the left big toe. To uncover which stimulus created a certain brain activation requires a reversal, with Bayesâ rule to the rescue.
Consistently with the current literature, we confirmed that group-sequential Bayesian trials are not immune to Type I error rate inflation. Moreover, we showed that Bayesian operating characteristics such as the risk of erroneous conclusions and the informative value of an efficacy conclusion degrade as the number of interim analyses increases and are sensitive to prior divergence between stakeholders. While it is possible to control the frequentist operating characteristics of Bayesian trials with appropriate methods, it is not possible to guarantee such control for Bayesian operating characteristics because they are prior-dependent and different stakeholders may reasonably adopt different priors.
We consider trials with sequential enrollment and interim analyses at pre-specified sample sizes đ_1 , ⌠, đ_đ, with a final analysis performed after đ = 500 efficacy measures have been > observed. We consider both a single final analysis and group-sequential designs, with
đ â {1, 2, 5, 10, 20, 50, 100, 250, 500} the number of analyses.
This is an interesting coincidence, as I had just watched a video Scott Berry gave on this topic. I strongly suggest people interested in this watch it.
My initial take is that this is an entirely unrealistic simulation of a Bayesian trial, Perhaps @f2harrell could give an opinion. Even likelihoodists know that the likelihood can be misleading at small sample sizes. You can see this âType I inflationâ in their Figure I. Royall prefers the term âmisleading evidenceâ, and there is a bound k, such that the likelihood ratio will not exceed it, thereby providing evidence that H > 0, when H \leq 0. That bound k he discusses is closely related to the Tippett minimal p value test sometimes used in meta-analysis.
As Scott discussed in the video, sequential trials described as having a \alpha = 0.05 are almost always tested using 1 tail at 0.025. I take the authors at their word, and when they simulated a trial at 0.05, they used a 1 tail test (ie. Z = 1.64), which doesnât represent practice as those who design studies describe them.
One thing that concerns me, is the use of the p-value to compute the posterior, under certain assumptions. In table 2, the authors state the following:
e_i : The frequentist concludes efficacy if p-value < e_i.
The Bayesian concludes efficacy if P(\mu > 0 | X_1 ... X_k) > 1 - e_i
Having thought about this in the context of meta-analysis, I am familiar with their assumptions and the calculations. I discussed them in a number of threads (see below).
Interim p-values are almost invariably much lower than the 0.05 level used in their simulations. These adjustments to the p-value are entirely consistent with Bayes theorem.
Their assumptions essentially translate to a prior on the probit scale, instead of doing formal likelihood calculations. On the probit scale, Kulinskaya, Morgenthaler, and Statute acknowledge that the observed Z score has a random component with a standard error of \pm 1.
Iâd like to know what the actual likelihood looked like in those simulations.
Contrast this to a likelihood curve in this post. Iâd bet a fairly large number of false discoveries in their simulations would have relatively flat likelihood functions that would not persuade anyone to stop the trial. Their âearly stoppingâ is a function of how they compute the posterior, which is related to all of the flaws of p-values.
The following paper describes my intuition regarding their computation of the Bayesian posterior via a transformation of the p-value.
Wouter Kager, Ronald Meester (2026) On the relation between likelihood ratios and p-values for testing success probabilities of Bernoulli trials
It is well known that there is no direct one-to-one relation between p-values and likelihood ratios or Bayes factors, since their relation crucially involves the sample size n. We investigate their (asymptotic) relation in a coin-tossing context where the hypotheses of interest address the success probability of the coin, and where detailed computations are possible. This leads to useful insights in the nature of p-values and likelihood ratios. Our results imply, for instance, that under mild conditions, a p-value of 0.05 cannot correspond to a likelihood ratio larger than 7.5, for any hypothesis versus a null hypothesis that the success probability has a specific value. We also show it is unlikely one can obtain a large likelihood ratio by tossing a fair coin until the number of heads deviates from the mean by several standard deviations.
The Bayes factor bound 1/-e \times p \times ln(p) is the best case Bayes factor. For p = 0.05, the best case Bayes factor is no more that about 2.5 to 1 in favor of the alternative over the null.
I re-read more closely a section of the paper. They used different variances to give weight to the prior, so my initial criticism (that their analysis was based on a naive uninformative prior) was mistaken.
I still maintain their âBayesian computationâ suffers from confusing the hypothesis H > 0 vs H \leq 0 with H = 0 vs H > 0. That is also discussed in the Wouter and Meesler paper.
They note:
Here, for u > 0, we recognize ÎŚ(âu) as the 1-sided p-value for testing the null hypothesis H0. So we have found a one-to-one relation, which does not involve n, between the p-value and a likelihood ratio, after all. But this particular likelihood ratio does not consider the null hypothesis at all, and provides a rather unexpected Bayesian interpretation of the p-value.
It is odd to compute the equivalent of a LR for the sign, then argue that this is a Type 1 error by conditioning on an exact value of 0, when strictly speaking, the probability the true parameter is exactly 0, is itself 0.
The relevant alternative hypothesis for a one-sided p value is the parameter on the opposite side of the null, that has the same absolute value as the observed estimate â ie. what is the probability of seeing the observed sign when our estimate comes from a model where the true value has the opposite sign?
It is very easy to confuse these closely related hypotheses. Iâve done it countless times when working through calculations.
In a Bayesian interpretation of p-values, it merely provides a lower bound on the possibility that the observed sign is wrong.
Jeff Blumeâs intro on likelihood functions is also valuable:
Recall that the coverage probability of a CI can change even when the data, and the likelihood function, do not. The natural interpretation of an unadjusted CI as the âcollection of hypotheses that are best supported by the dataâ is justified by the Law of Likelihood.
I suspect something similar is happening here, except that rejection probabilities are occurring under likelihood ratios are close to the bound where they might still give misleading evidence.
He continues:
The universal bound applies to both fixed sample size and sequential study designs. In a sequential study, the probability of observing misleading evidence does increase with the number of looks at the data. However, the amount by which it increases shrinks to zero as the sample size grows. As a result, the overall probability can remain bounded (Blume 2007). The probability of observing misleading evidence in a fixed sample size study is less than the corresponding probability in a sequential study, and both are less than 1/k. This is shown below, where the subscript on P denotes the âtrueâ hypothesis.
Some interesting links on likelihood theory
Here is a nice primer on likelihood methods for epidemiologists with @Sander as one of the listed authors:
Some relevant threads:
The Kass paper on data translated likelihoods would be helpful to those working through this paper.
The logic error being made by authors who claim to show that Bayesian sequential procedures have problems with operating characteristics, besides avoiding decision theory, is the assumption that a Bayesian wants to also be a frequentist. In other words, they assume that a Bayesian should have a keen interest in the probability that something might have happened at another time in another dataset, instead of being happy with the probability that the treatment works given the current data. I think of a crazy analogy in the TV show The Office where at a night out party one team beat the other in a contest and the other team (led by the boss) then says that to get the prize the initially winning team has to also throw a shoe over the roof of the building the farthest.
Bringing in irrelevant considerations can make any decision rule look bad. And why is it that frequentists are never demanded to demonstrate excellent Bayesian operating characteristics?
But arenât they also evaluating Bayesian operating characteristics? Is there a concrete issue the way they are evaluating / simulating these Bayesian OCs?
No! Bayesian OC = correctness of decision under a possibly mismatched prior. Has nothing to do with \alpha. See simple true Bayesian simulations in the link I provided. One example shows the extreme conservatism of frequentist groups sequential methods which makes decisions far, far too late.
P(correct decision) is an unconditional (except on the data) probability. \\alpha is a conditional probability made worse by conditioning on the unknowable.
I think this sums up what I was struggling to put into words. Surprise under the H=0 is not relevant when you degrade the procedure into merely giving information regarding the sign.
Their âBayesian computationâ of the posterior is based upon a one sided p-value. The competitor in their formulation is not 0, but the value that has the same magnitude as the observed estimate, but the opposite sign.
Their critique is based upon the magnitude of the difference from 0. Information on the sign of the effect does not rule out that the actual estimate is close to zero, so it isnât relevant. .
Regarding magnitude of a group difference, the pre-study skepticism about the effect is a concept that is independent of the number of looks and stopping rule. So you apply the same skepticism (prior) at any point in a sequential trial. It will draw back the posterior median/pseudomedian/mean/mode by the right amount, with more pullback at early looks. Bias upon repeated sampling is not at the center of evidence interpretation.
[quote=âf2harrell, post:17, topic:28342â]
So you apply the same skepticism (prior) at any point in a sequential trial. It will draw back the posterior median/pseudomedian/mean/mode by the right amount, with more pullback at early looks. [/quote]
This is what I found when I did similar computations a long time ago. Translating any skeptical prior into a statistic computed under a normal model always leads to needing the observed data falling into percentile much lower than 0.05. I could not understand their conclusion, despite their formulas looking correct.
It appears people who come to this conclusion (ie. Bayesian trials have problems with early stopping) are confusing 2 different but related questions.
Their conclusion is based on a hodgepodge mixture of frequentist and Bayesian ideas, so itâs hard for anyone to understand. Bottom line: If youâre going to be Bayesian, be Bayesian.