Interim Analysis & "Alpha Spending" in RCT's using Bayesian Analyses



Some folks already know - I’m a trained frequentist statistician that’s become progressively more Bayesian by philosophy, though my technical knowledge is still very much a work in progress, and many of the trials that I work on are analyzed using frequentist methods.

I would like to start proposing Bayesian methods (as I acquire the relevant knowledge, where appropriate) but one thing that I have a difficult time communicating to others is the alpha-spending penalty for multiple interim looks required by frequentist analyses and why that does not apply to interim analyses in Bayesian statistical approaches.

I feel like I understand why this is true, but I can really struggle to communicate or explain it well. Would a few card-carrying Bayesian trialists be willing explain this in relative lay language so I can see if one of those phrases “clicks” a little bit more with me and, hopefully, my future collaborators?


This is an excellent question and an area where we need to create more educational material. The best way to explain the irrelevance of multiplicity adjustments for sequential Bayesian analysis is simple simulations as done here. And the R code provided there exposes what is going on. If the “population of unknown treatment effects”, a.k.a. the prior distribution, is the same one used for the analyses, then Bayesian posterior probabilities computed at the moment of stopping the trial are perfectly calibrated. In other words, if you simulate unknown efficacies under a certain prior and use that prior also for analysis all is well. If the analysis prior is more conservative than that, all is also well. All a Bayesian has to show is that posterior probabilities are well calibrated to the decision maker. The stopping rule and data look schedule are irrelevant. You’ll see a nice byproduct of Bayes in that simulation: if you stop early for efficacy, the posterior mean is perfectly calibrated. Contrast that with the frequentist point estimate, which is biased away from the null when you stop early.

The code exposes another very important point. Frequentist null hypothesis testing is concerned with controlling how often you assert an effect when none exists. In other words, if there is no effect how often do we get a signal inconsistent with that one value. Bayesian analysis is concerned with attempting to recover the unknown efficacy generating the data no matter what amount of efficacy exists.

Perhaps the following point should be communicated to collaborators before the points above. Though often worried about, type I error probabilities \alpha, like power, are really pre-study concepts and not “how do we interpret results at the end” concepts. And less controversially, \alpha is not a false positive probability in the least. This confusion has prevented many investigators (and statisticians) from seeing things clearly. \alpha is a conditional probability (conditional on the effect being exactly zero). What event or condition is it a probability of? It is the probability of making an assertion of efficacy. It is not the prob. that the treatment doesn’t work. Another way of saying this is that \alpha is a probability about data, not a prob. about a parameter.

So in the task of trying to preserve \alpha researchers and statisticians have lost sight of the fact that type I errors are not really errors in the sense we need them to be, but are just probabilities of assertions. Because of that, Bayesian analysis does not need to be concerned with them.


A building block of frequentism is the comparison of the data to a hypothetical sampling distribution. If you look at your data after every 100 new data points and then decide to go further or to stop, you should compare those data to a sampling distribution from countless hypothetical samples drawn in that specific way. You can’t just use default p-values. Bayesian analysis does not have this specific problem because no comparisons are made with a hypothetical distribution resulting from repeated sampling.

If you read “error control” as “minimizing the risk of drawing wrong conclusions from the data”, in Bayesian statistics the choice of the prior plays that role, I’d say.


That’s a wonderful explanation. I love it.

The only problem is, our collective quantitative literacy in medicine is not very strong, and I fear it would be lost on the folks I am trying to explain it to. But I will definitely keep this in mind and try to figure out a way to spell this explanation out even more clearly.


What do people think about this sports analogy? The frequentist-sampling approach to making a decision about whether to turn off the TV when watching an American football game (because the team that is ahead in the score is very likely to win the game) could proceed as follows. Your team is ahead by 10 points with 13 minutes left in the game. Of all the games where a team lost or was tied, what proportion of them had your team ahead by at least 13 points with 10 minutes to go? You decide to keep watching and there is still a 10 point difference at 11 minutes to go. Consider the proportion of games where a team lost or was tied where it was ahead by at least 10 at either 13 or 11 minutes to go.

:new: The frequentist football watcher must compute proportions that involve multiple looks because he is not very interested in the probability his team will win. He is interested in long-run operating characteristics over many games, and his notion of an “error” is to, in a game when his team tied or lost, proclaim in any point in the game, an assertion of “My team will win. The game is decided”. The long-run operating characteristic he wishes to control is the proportion of games he will watch (at all) for which his team ultimately did not win, where his team was ahead by 10 points at any of the points in the game in which he might look.

Contrast that with the Bayes-based decision. Compute the proportion of times a team ahead by 10 with 13 minutes to go ultimately won the game. Two minutes later, compute this proportion again. Compute it as often as you like. Whenever you decide to turn off the TV, you just consider the last computed proportion, which completely overrides all previously computed proportions. The only way a Bayesian can cheat in such a situation is to not like the current probability, ignore it, and revert back to a previously calculated probability even in the face of new information being available.

I would love to find a simpler example.

Much of this is about forward conditioning vs. backwards time- or information conditioning. Backwards conditioning (condition on the effect being exactly zero and compute data probabilities) needs to consider “what might have happened” whereas forward conditioning considers “what did happen” (the current score in the game). And the frequentist approach must deal with intentions. When computing \alpha in a sequential trial, one adjusts for intended looks, even when a look ended up not being taken (e.g., data monitoring committee meeting canceled because of a snowstorm). I consulted on a study where “success” was based on taking a penalty on type I error for one interim data look where there was no time to even do the look because of faster than anticipated patient enrollment.


I think the challenge with a game metaphor is that, for the Bayesian football fan, the chance of a “win” at the conclusion of a game is still a random quantity. The end of the game and declaration of “winner” is kind of arbitrary . Because if, for instance, the game were extended indefinitely, you could still accumulate evidence, tides could turn, etc.

This limitation really underscores philosophical differences in the approach to clinical trials. And issues beyond sequential design! If I power my frequentist study with 1,000 patients and get an HR of 0.85 (p > 0.05), re-estimating the sample size leads to a lot of difficulty calculating the p-value, because I only decided I needed more patients as a result of my significance test.

The Bayesian has no problem saying that their posterior would only get more precise by rolling in more and more patients (provided IID assumptions are met… which is rarely the case in big, long trials). The “end” of a trial is to a Bayesian just a snapshot of belief, and need not generalize to other studies. For a Bayesian’s decision theoretic framework, it suffices to say, “The Pats being very likely to win is a good enough victory for me.” Which, if the study/game were performed/played again, would not necessarily replicate.

This is an ethical issue too, of companies seeking statistical significance by just doing bigger and bigger and bigger trials. Sure if a drug is harmful you hopefully know sooner rather than later. But randomizing patients and waffling about seeking statistical significance (whether you are Bayesian or frequentist) means you are probabilistically denying patients access to the tried-and-true treatments with known risk profiles. For instance, in a study of bananas as analgesics, if there weren’t equipoise, I’d be mad and say give me the damn aspirin!

To me, I prefer the rigidity of formal stopping rules and alpha spending for those reasons.


You could define (ethical) stopping rules in a bayesian design as well, I’d say. For instance: “I will submit 300 patients to this test, unless I know enough after fewer patients, which is the case when my Highest Posterior Density meets certain requierements”


Maybe I’m misinterpreting something here, but why do you believe that Bayesian designs don’t have formal stopping rules, or that they would lead to bigger and bigger trials?

I’ve cited this example a few times but the DAWN trial was fully Bayesian and had a series of pre-planned interim looks (it terminated at the first possible look for efficacy):


I didn’t say that Bayesians don’t have formal stopping rules, although to me it’s a desirable trait of frequentist statistics to point out to non-statisticians that the p-value’s interpretation changes unpredictably as a result of unplanned modifications to the study design. Formal rules enhance interpretation. It looks like Dawn employed stochastic curtailment, which I think Dan Gillen and Scott Emerson have derived for frequentist trials as well (seeking citation). I operate so infrequently (never) in the context of Bayesian designs, I know of no equivalent statement to encourage compliance with prespecified analyses. If you have convincing language around this point, I would like to hear it.

The one exception I make in this case is the monitoring of patient/participant/trial safety, typically done by DMCs or DSMBs which as I see it is a minor cost in type II error to the benefit of protecting patients.


I don’t see that point. At the end of the game there is a definite known gold-standard winner, unless there is a tie.


I suppose what he’s getting at is that the end result of the game (a win or loss) is not a definitive statement about the broader truth of “which team is better” in the infinity of possible games that these teams could play, only a reflection of who scored more points in those 60 minutes; which has some parallels with a drug trial, since at the end of the trial, we are left with a “probability” that Drug A is better than Drug B based on the data observed during the trial, not a known truth that Drug A is better than Drug B over the next infinity of patients who may be treated.

I’d love for Adam to clarify further, though.


That would make sense. The question of which team is truly better is a distinctly separate long-run question. I’m restricted this to the question of which team will win a single game that is underway.


Just want to comment on

And the frequentist approach must deal with intentions.

The “must” is too strong. Some ways of doing frequentist inference do depend on intentions but it is not necessary. The “alpha-spending function” approach of Lan and DeMets is specifically designed to permit unplanned interim looks.

Tangentially, when it comes to p-values in trials with interim looks the problem of mathematically operationalizing the meaning of “extreme” in “probability of a result as or more extreme than the observed result under the null” arises. In their text Statistical Monitoring of Clinical Trials: A Unified Approach, Proschan et al. recommend an ordering on outcomes called “stagewise ordering” precisely because it only depends on what’s occurred in the past and not on what might or might not occur in the future, so intentions do not matter. Other orderings that do depend on intentions can’t be used for trials that use alpha spending and unplanned interim looks but stagewise ordering can.

Confidence intervals under stagewise ordering are super-duper weird though. They’d make a great example for a Morey-style “no confidence in confidence” critique. (In fact I’m in the middle of writing a blog post that touches on this issue.)


You make some good points, and the incoherence between tests and confidence intervals in this setting has always been disturbing. But I want to make a fine point. Unplanned interim looks can be accounted for, but not so much planned looks that don’t happen, as in my example. One reason for that is that for product registration with FDA, etc., you specify the critical region for the final statistical test at study end. The critical value depends on planned interim looks, actualized or not. When a look doesn’t happen, the \alpha that is spent by study end is less than the originally specified amount. So in that sense intentions must be respected. A pure Bayesian approach needs none of that.


I find it a highly UNdesirable characteristic of the frequentist method that conclusions depend on the intentions of the researcher and things that could have happened, but did’nt.

Saves a lot of mind reading :grin:


How about this? A sample contains by defintion only information of that sample. You can only learn from data if you have something to contrast that data with, or to combine it with. Bayesian statistics combines that data with prior knowlegde. The data is then used to update that knowlegde. The prior is an important startingpoint, so should be well explained and justified.

Frequentist statistics uses a procedure to contrast the data with. A low p-value means that the data are unlikely, compared to data that one would have if they would have sampled following the procedure. Die example: rolling a series of thirty 6s in row is unlikely, if my die were fair. If I would roll the die countless times, this would not happen very often. So I compare my actual rolls with hypothetical rolls and conclude that my actual rolls are unlikely if my die were fair.

The procedure (the hypothetical die rolls) is very important here, because it is the starting point of the comparison. Therefore, you cannot mess with the procedure. Part of the procedure that p-values are based on is: no data peeking!

Hmmm… not sure if this is easier actually :wink: But anyway, it might be useful for some readers.

Feel free to point out mistakes, if I made them.