Bayesian vs. Frequentist Statements About Treatment Efficacy

This is a place for discussing Bayesian vs. Frequentist Statements About Treatment Efficacy.


These comments were made using disqus on an old blog platform. Most of them were made between 2017 and 2021.

Ahmed: Frank, thanks for your posts and lectures I depend on your lectures as main references, I have a question please, what do you think about credible intervals for one treatment group? is possible? I did that with r as below with the different types of prior and I did not find a significant difference.


pi = 0.2 # event rate
n= c(30, 50, 100)
res = NULL
for (i in 1:length(n)) {

event = rep(1, n[i])
n0 = round(n[i]*pi,0)
event[sample(n[i],n0)] = 0
mydata= data.frame(event = event, trt= rep(1, length(event

glm_prior <- normal(location = 0.3, scale = 0.2, autoscale = FALSE )

model.or <- stan_glm(event~trt, data = mydata,
family = "binomial",
prior = glm_prior,
prior_intercept = glm_prior,
mean_PPD = FALSE,
seed = 123450,
refresh = 0)
post =

post = apply(post, 2, function(x) exp(x))

ci.beta.l = round(apply(post, 2, function(x) quantile(x, 0.025) ), 2)
ci.beta.u = round(apply(post, 2, function(x) quantile(x, 0.975) ), 2)

res = rbind(res,c( n[i], LCI = ci.beta.l, UCI = ci.beta.u))


> res

[1,] 30 1.18 2.36
[2,] 50 1.32 2.58
[3,] 100 1.61 2.90

I noticed there is no significant difference as the sample size increase. Do you agree with me?

Frank Harrell: Perhaps it’s best if you describe what you are trying to show, instead of me trying to infer it from the code. Note that in a two-treatment comparison credible interval for a parameter for one treatment group is not very relevant; we need the credible interval for the difference between groups. Or better a series of posterior probabilities for the difference being greater than delta for an array of deltas.

Ahmed: I want to show that the credible interval over one treatment group and different types of prior with non-zero mean is close to each other, that is right? and the process that I followed for that is correct ?

FH: No, that approach would apply only for a single-arm study. You need to do a comparative study, i.e., show the posterior distribution for the difference between treatments.

I just published a new blog article about sequential testing, with the simulation.

If you assume one value, you are choosing a prior with all its mass at one point. Fully works, but a strange state of knowledge about the truth. I’ll be posting a new blog article with simple simulations showing that you can compute posterior probabilities infinitely often with no downside. It will be around 2017-10-16.

Donald Williams: Interesting post. I do simulations with Bayes both ways: 1) assume one value; or 2) draw from prior and then generate data.

That said, could you post some code in regards to your previous comments on peaking many times.

FH: Thanks for continuing the discussion. The practical way of talking about the single value of an unknown parameter is that it requires you, when doing a simulation for example, to know that one value. Bayesians operate on the opinion that this is presumptuous. So you can think of having a prior distribution as having a nice way to admit what we don’t know and don’t have access to.

I respectfully disagree with all of your third paragraph. The capturing of prior evidence about a parameter in fact must not be influenced by the stopping rule, and the Bayesian approach has no place (nor method) to put the sample space in any of the calculations. The beauty of Bayes is that when you calculate probabilities that are forward in time and information flow, these probabilities are sufficient and simply interpreted, unmolested by multiplicity, etc. Multiplicity problems result from the use of probabilities like type I error that are backwards in time and information flow, so you have to envision “what might have been.”

Stephen Martin: Thank you for replying; I disagree that thinking a parameter takes on one value makes one a frequentist. When I generate data via a DGP that I create, there is A fixed value. The fixed parameter is indeed “the” parameter responsible for the data generating procedure. The parameter does not change, even if the analysis does not depend on a specification of a fixed parameter (which frequentist analyses do).

That is not to say parameters are unable to change across time or subpopulations, but in the context of a simulation when I know the DGP responsible for generating the observations, there is indeed a fixed parameter, not a set of parameters randomly drawn from an urn at any given time.

And to say Bayesian methods do not have anything to do with the sample space is the likelihood principle that I think is too often thrown around. There are stopping rules that can restrict observable sets of data, in which case the likelihood should account for that (to the degree that the likelihood is a description of probabilistic generation of observations). Alternatively, a stopping rule can alter the prior probability (or parameter process) of a parameter, and this is apparent when you include a stopping rule into bayes theorem itself; at some point, the prior can become conditional on the stopping rule, and a proper treatment requires the inclusion of a prior density that combats how the stopping rule changes the prior probability of a parameter. This isn’t a violation of the likelihood principle, but rather proper modeling of a DGP under certain stopping rules.

However, I agree that the interpretation of a posterior does not change, but that isn’t really as important to me. This is a similar issue to BFs — The interpretation of a BF is independent of a stopping rule, but the probability of a decision made from a BF can certainly change as a result of stopping rules. And the latter is more important to me — If the stopping rule can alter the probability of a decision, then stopping rules are nevertheless important to me as a Bayesian. It’s just due to collecting data until a sequence of events matches one prior predictive distribution better than another, then stopping — This still results in altered decision rates. Same with continuous posteriors — If you observe until a set of events satisfies a condition, the probability of meeting that condition is different than if you do not have such a condition. The consequences are affected by the stopping rule, even if the interpretation does not.

FH: Your opinion is completely consistent with someone who wishes the unknown parameter to be able to take on exactly one value, i.e., a frequentist. If you really want to believe that then you should not spend any time on this Bayesian stuff. On the other hand, Bayesians describe unknowns with distributions. You don’t have to agree with that. But it tends to solve a lot of problems and allow us to represent uncertainty in a reasonable way.

Your last paragraph doesn’t follow for the Bayesian approach. As shown in papers by Berry and others, the likelihood principle used by the Bayes machine has no place for the stopping rule and must ignore it. It would be improper to incorporate modifications of the data space into the Bayesian model. Bayes has nothing to do with the sample space. Bayes is unaffected by the stopping rule (except for the Bayesian power being affected, i.e., the probability that you’ll achieve a certain high level of posterior probability of efficacy), and the Bayesian interpretation is unaffected by the stopping rule. Further, if you don’t accept that the true parameter values being simulated from should following a whole distribution, then chose a prior that puts mass at only one point or at a few points. Infinitely many looks at the data will still yield valid posterior probabilities at any moment.

To put this another way, whatever representation you want to make for the unknown parameter, when used as a prior, will result in a perfectly calibrated posterior probability at any moment, assuming the posterior used the same prior that you simulated from.

Stephen Martin: “The average posterior probability at the moment of stopping was 0.96 which exactly equalled the proportion of the 10,000 simulations in which the true efficacy was positive.”

I’ve seen people use this logic in the past. Vary whether some statement A is true; simulate optional stopping in Bayes; the proportion of replicates for which the stopping rule was met matches the probability that A is true.

However, that doesn’t sit right with me. If the true state of the universe is A, then A is just simply… true. That sort of assumes that the universe randomly generates whether A is true or not, with some fixed probability. 80% of the time, A is true; 20% of the time, it’s not. Is that not a weird assumption to you?

I love Bayesian stats, don’t get me wrong, but I don’t buy that Bayesian models are unaffected by stopping rules. Seems to only be true if you assume truth is drawn from an urn, and I can’t back that. In simulations where there is a Truth, period, and you’re simulating from that true state, optional stopping does affect expected rates of decisions. Bayesian models can account for that (by modifying the likelihood to include how optional stopping can modify the observable data space for some stopping rule and N; by modifying the priors), but it seems like those who say “Bayes is unaffected by stopping” either mean “bayesian interpretation is unaffected” [true] or “decisions rates match the proportions for which those decisions are true” [questionable, assumes truth is drawn by universe ala an urn problem].

Would love to hear your thoughts though.

FH: Spiegelhalter book

You’ve raised a lot of important points and I may not get to all of them right away. The reasons I don’t care (and no one doing treatment students should care) about type I errors are detailed here. What we need is the probability that the treatment is ineffective (in an efficacy study), which is just one minus the posterior probability that it is ineffective. We need evidence for the one study at hand, and long-run error rates are not relevant for that. Regarding 100 DVs, the logical point in the logic flow for inserting skepticism about any one of them is the prior probability, not by putting skepticism on data such that the way you view one DV is influenced by the way you viewed the other.

Concerning the sensitivity analysis, I see some sense in that. But in a regulated environment we are more likely to need to get the prior agreed upon jointly by the sponsor and the regulator.

I’m honestly having trouble seeing how the frequentist approach augments a Bayesian analysis. I was a frequentist for about 20 years and was educated only in the frequentist paradigm, so no one can claim I didn’t give it a fair shot.

Concerning equivalence tests I think we got off on the wrong foot in envisioning equivalence (which should be ‘similarity’) as something you test rather than something you estimate. A posterior probability is a direct evidential quantitate and is something we estimate.

Yes I should have qualified that by the mean.

I’ll do that along with graphical output in a future blog, probably in a couple of weeks. It’s very simple - one-sample problem, in R.

Unknown: Could you post your simulation code?

Daniel Lakens: In a Frequentist paradigm, and using the idea that anything that lowers SPB by more than 3mmHg is 0.8 is meaningful, the appropriate test to discuss is an equivalence test (such as TOST). Using TOST and Bayes, you can easily perform power analyses to design an informative study, control the Type 1 error rate (which you don’t care about, but if I were a patient receiving a drug you worked on, I would care about), and then you can still add the posterior probability. I don’t know why you wouldn’t at least raise the bar in your Frequentist criticism a little bit. The p > 0.05 so no effect fallacy is easy to criticize, but also trivial. I’d much rather read your criticism on equivalence testing, and learn something less trivial.

Your report of the posterior probabilities are also not very attractice. You can easily calculate your poaterior, but you can’t compute mine. And for me, my posterior is what matters - I don’t care about what you believe. So a better report would contain a snesitivity analysis, plotting posteriors across a range of priors. Do you agree, or not? That obviously changes the conclusion a bit.

Overal, it is my strogn conviction that you love more by ignoring Frequentist stats, than you gain. As long as you make correct inferences (from a Neyman-Pearson perspective) you complement your research, especially when designing studies, at almost no cost (because you will end with your posterior anyway). Bayesian stats has limitations, and Frequentist stats has limitations, but there is nothing preventing you from embracing the relative strengths of both approaches. Saying ‘I don’t care about error rates’ is your right, but you should expect a decent proportion of readers to care about it. Alternatively, you can discuss how you would in practice deal with situations where error control matters - e.g., exploring 100 DV’s, and reporting the one with the highest posterior is perfectly fine in Bayesian stats, but I see no guidelines on how to prevent massive amounts of misleading information if people work like this.

Georg: ‘Assuming prior distribution p1 for the mean difference of SBP, the probability that SBP with treatment B is lower than treatment A is 0.67.’
Clinical researchers will appreciate this as they now think that their study proves that ‘2 out of 3 patients will benefit from treatment B’ (despite the insignificant result). Correct me if I am wrong, but I guess the statement should deal with mean SBP.

Unknown: A useful post, thank you. Can you suggest some readings/references for someone new to Bayesian methodology that is specific to the clinical trials context?

*FH•: Thanks for your comments. Scott Berry’s statement was in the context of computing frequentist type I error, which I do not care about. I think that the simulation you described did not calculate the needed probability. You don’t want the proportion of time the posterior probabilities crossed a threshold. That would be related to Bayesian power. What we want is to determine whether the posterior probability at the moment of stopping is well calibrated. I ran one simulation of 10,000 clinical trials with 400 looks at the data (one look after each new subject is added) with a rule to stop when the posterior prob. exceeds 0.95. The average posterior probability at the moment of stopping was 0.96 which exactly equalled the proportion of the 10,000 simulations in which the true efficacy was positive.

There are no multiplicity problems with Bayes. Multiplicity comes from the chances you give data to be more extreme (relevant in the frequentist world) not the chances you give assertions to be true.

Unknown: If I understand the last sentence correctly, then I do not agree with it.

I remember several years ago hearing something similar (but I think I misunderstood at that point, did not hear enough) and tried a test. I simulated a set of 100 values from a binomial (actually Bernoulli) with proportion 0.5 and starting with the 10th observation computed a posterior distribution given a uniform prior and binomial likelihood. From the posterior I calculated the probability that the true proportion was less than 0.5 and had the simulation stop and report the posterior if that probability was less than 0.05. Then I ran this whole process a bunch of times (at least 1,000 but I don’t remember exactly). When I looked at all 100 draws from each simulation then the proportion of posterior probabilities less than 0.05 was about 5%, but if I let the simulation stop early, then the proportion was about 14%. Researcher degrees of freedom and the garden of forking paths can affect Bayesian analysis as well.

Later I saw a presentation by Scott Berry (and later read his book on Bayesian adaptive clinical trials) where he recommended using simulations to choose an appropriate prior and probability cut-off on the posterior to give desired properties of the trial.

The best option for multiple looks at the data with possible early stopping is to use the simulations to choose the prior and stopping rule. At least the Bayesian analysis should honestly report the number of actual and potential looks at the data.