Bayesian vs. Frequentist Statements About Treatment Efficacy

This is a place for discussing Bayesian vs. Frequentist Statements About Treatment Efficacy.

Archive

These comments were made using disqus on an old blog platform. Most of them were made between 2017 and 2021.

Ahmed: Frank, thanks for your posts and lectures I depend on your lectures as main references, I have a question please, what do you think about credible intervals for one treatment group? is possible? I did that with r as below with the different types of prior and I did not find a significant difference.

library(rstanarm)

pi = 0.2 # event rate
n= c(30, 50, 100)
res = NULL
for (i in 1:length(n)) {



event = rep(1, n[i])
n0 = round(n[i]*pi,0)
event[sample(n[i],n0)] = 0
mydata= data.frame(event = event, trt= rep(1, length(event
)))

glm_prior <- normal(location = 0.3, scale = 0.2, autoscale = FALSE )

model.or <- stan_glm(event~trt, data = mydata,
family = "binomial",
prior = glm_prior,
prior_intercept = glm_prior,
QR=FALSE,
mean_PPD = FALSE,
seed = 123450,
chains=10,
refresh = 0)
post = as.data.frame(model.or)

post = apply(post, 2, function(x) exp(x))

ci.beta.l = round(apply(post, 2, function(x) quantile(x, 0.025) ), 2)
ci.beta.u = round(apply(post, 2, function(x) quantile(x, 0.975) ), 2)

res = rbind(res,c( n[i], LCI = ci.beta.l, UCI = ci.beta.u))

}

> res

n LCI UCI
[1,] 30 1.18 2.36
[2,] 50 1.32 2.58
[3,] 100 1.61 2.90

I noticed there is no significant difference as the sample size increase. Do you agree with me?

Frank Harrell: Perhaps it’s best if you describe what you are trying to show, instead of me trying to infer it from the code. Note that in a two-treatment comparison credible interval for a parameter for one treatment group is not very relevant; we need the credible interval for the difference between groups. Or better a series of posterior probabilities for the difference being greater than delta for an array of deltas.

Ahmed: I want to show that the credible interval over one treatment group and different types of prior with non-zero mean is close to each other, that is right? and the process that I followed for that is correct ?

FH: No, that approach would apply only for a single-arm study. You need to do a comparative study, i.e., show the posterior distribution for the difference between treatments.

I just published a new blog article about sequential testing, with the simulation.

If you assume one value, you are choosing a prior with all its mass at one point. Fully works, but a strange state of knowledge about the truth. I’ll be posting a new blog article with simple simulations showing that you can compute posterior probabilities infinitely often with no downside. It will be around 2017-10-16.

Donald Williams: Interesting post. I do simulations with Bayes both ways: 1) assume one value; or 2) draw from prior and then generate data.

That said, could you post some code in regards to your previous comments on peaking many times.

FH: Thanks for continuing the discussion. The practical way of talking about the single value of an unknown parameter is that it requires you, when doing a simulation for example, to know that one value. Bayesians operate on the opinion that this is presumptuous. So you can think of having a prior distribution as having a nice way to admit what we don’t know and don’t have access to.

I respectfully disagree with all of your third paragraph. The capturing of prior evidence about a parameter in fact must not be influenced by the stopping rule, and the Bayesian approach has no place (nor method) to put the sample space in any of the calculations. The beauty of Bayes is that when you calculate probabilities that are forward in time and information flow, these probabilities are sufficient and simply interpreted, unmolested by multiplicity, etc. Multiplicity problems result from the use of probabilities like type I error that are backwards in time and information flow, so you have to envision “what might have been.”

Stephen Martin: Thank you for replying; I disagree that thinking a parameter takes on one value makes one a frequentist. When I generate data via a DGP that I create, there is A fixed value. The fixed parameter is indeed “the” parameter responsible for the data generating procedure. The parameter does not change, even if the analysis does not depend on a specification of a fixed parameter (which frequentist analyses do).

That is not to say parameters are unable to change across time or subpopulations, but in the context of a simulation when I know the DGP responsible for generating the observations, there is indeed a fixed parameter, not a set of parameters randomly drawn from an urn at any given time.

And to say Bayesian methods do not have anything to do with the sample space is the likelihood principle that I think is too often thrown around. There are stopping rules that can restrict observable sets of data, in which case the likelihood should account for that (to the degree that the likelihood is a description of probabilistic generation of observations). Alternatively, a stopping rule can alter the prior probability (or parameter process) of a parameter, and this is apparent when you include a stopping rule into bayes theorem itself; at some point, the prior can become conditional on the stopping rule, and a proper treatment requires the inclusion of a prior density that combats how the stopping rule changes the prior probability of a parameter. This isn’t a violation of the likelihood principle, but rather proper modeling of a DGP under certain stopping rules.

However, I agree that the interpretation of a posterior does not change, but that isn’t really as important to me. This is a similar issue to BFs — The interpretation of a BF is independent of a stopping rule, but the probability of a decision made from a BF can certainly change as a result of stopping rules. And the latter is more important to me — If the stopping rule can alter the probability of a decision, then stopping rules are nevertheless important to me as a Bayesian. It’s just due to collecting data until a sequence of events matches one prior predictive distribution better than another, then stopping — This still results in altered decision rates. Same with continuous posteriors — If you observe until a set of events satisfies a condition, the probability of meeting that condition is different than if you do not have such a condition. The consequences are affected by the stopping rule, even if the interpretation does not.

FH: Your opinion is completely consistent with someone who wishes the unknown parameter to be able to take on exactly one value, i.e., a frequentist. If you really want to believe that then you should not spend any time on this Bayesian stuff. On the other hand, Bayesians describe unknowns with distributions. You don’t have to agree with that. But it tends to solve a lot of problems and allow us to represent uncertainty in a reasonable way.

Your last paragraph doesn’t follow for the Bayesian approach. As shown in papers by Berry and others, the likelihood principle used by the Bayes machine has no place for the stopping rule and must ignore it. It would be improper to incorporate modifications of the data space into the Bayesian model. Bayes has nothing to do with the sample space. Bayes is unaffected by the stopping rule (except for the Bayesian power being affected, i.e., the probability that you’ll achieve a certain high level of posterior probability of efficacy), and the Bayesian interpretation is unaffected by the stopping rule. Further, if you don’t accept that the true parameter values being simulated from should following a whole distribution, then chose a prior that puts mass at only one point or at a few points. Infinitely many looks at the data will still yield valid posterior probabilities at any moment.

To put this another way, whatever representation you want to make for the unknown parameter, when used as a prior, will result in a perfectly calibrated posterior probability at any moment, assuming the posterior used the same prior that you simulated from.

Stephen Martin: “The average posterior probability at the moment of stopping was 0.96 which exactly equalled the proportion of the 10,000 simulations in which the true efficacy was positive.”

I’ve seen people use this logic in the past. Vary whether some statement A is true; simulate optional stopping in Bayes; the proportion of replicates for which the stopping rule was met matches the probability that A is true.

However, that doesn’t sit right with me. If the true state of the universe is A, then A is just simply… true. That sort of assumes that the universe randomly generates whether A is true or not, with some fixed probability. 80% of the time, A is true; 20% of the time, it’s not. Is that not a weird assumption to you?

I love Bayesian stats, don’t get me wrong, but I don’t buy that Bayesian models are unaffected by stopping rules. Seems to only be true if you assume truth is drawn from an urn, and I can’t back that. In simulations where there is a Truth, period, and you’re simulating from that true state, optional stopping does affect expected rates of decisions. Bayesian models can account for that (by modifying the likelihood to include how optional stopping can modify the observable data space for some stopping rule and N; by modifying the priors), but it seems like those who say “Bayes is unaffected by stopping” either mean “bayesian interpretation is unaffected” [true] or “decisions rates match the proportions for which those decisions are true” [questionable, assumes truth is drawn by universe ala an urn problem].

Would love to hear your thoughts though.

FH: Spiegelhalter book

You’ve raised a lot of important points and I may not get to all of them right away. The reasons I don’t care (and no one doing treatment students should care) about type I errors are detailed here. What we need is the probability that the treatment is ineffective (in an efficacy study), which is just one minus the posterior probability that it is ineffective. We need evidence for the one study at hand, and long-run error rates are not relevant for that. Regarding 100 DVs, the logical point in the logic flow for inserting skepticism about any one of them is the prior probability, not by putting skepticism on data such that the way you view one DV is influenced by the way you viewed the other.

Concerning the sensitivity analysis, I see some sense in that. But in a regulated environment we are more likely to need to get the prior agreed upon jointly by the sponsor and the regulator.

I’m honestly having trouble seeing how the frequentist approach augments a Bayesian analysis. I was a frequentist for about 20 years and was educated only in the frequentist paradigm, so no one can claim I didn’t give it a fair shot.

Concerning equivalence tests I think we got off on the wrong foot in envisioning equivalence (which should be ‘similarity’) as something you test rather than something you estimate. A posterior probability is a direct evidential quantitate and is something we estimate.

Yes I should have qualified that by the mean.

I’ll do that along with graphical output in a future blog, probably in a couple of weeks. It’s very simple - one-sample problem, in R.

Unknown: Could you post your simulation code?

Daniel Lakens: In a Frequentist paradigm, and using the idea that anything that lowers SPB by more than 3mmHg is 0.8 is meaningful, the appropriate test to discuss is an equivalence test (such as TOST). Using TOST and Bayes, you can easily perform power analyses to design an informative study, control the Type 1 error rate (which you don’t care about, but if I were a patient receiving a drug you worked on, I would care about), and then you can still add the posterior probability. I don’t know why you wouldn’t at least raise the bar in your Frequentist criticism a little bit. The p > 0.05 so no effect fallacy is easy to criticize, but also trivial. I’d much rather read your criticism on equivalence testing, and learn something less trivial.

Your report of the posterior probabilities are also not very attractice. You can easily calculate your poaterior, but you can’t compute mine. And for me, my posterior is what matters - I don’t care about what you believe. So a better report would contain a snesitivity analysis, plotting posteriors across a range of priors. Do you agree, or not? That obviously changes the conclusion a bit.

Overal, it is my strogn conviction that you love more by ignoring Frequentist stats, than you gain. As long as you make correct inferences (from a Neyman-Pearson perspective) you complement your research, especially when designing studies, at almost no cost (because you will end with your posterior anyway). Bayesian stats has limitations, and Frequentist stats has limitations, but there is nothing preventing you from embracing the relative strengths of both approaches. Saying ‘I don’t care about error rates’ is your right, but you should expect a decent proportion of readers to care about it. Alternatively, you can discuss how you would in practice deal with situations where error control matters - e.g., exploring 100 DV’s, and reporting the one with the highest posterior is perfectly fine in Bayesian stats, but I see no guidelines on how to prevent massive amounts of misleading information if people work like this.

Georg: ‘Assuming prior distribution p1 for the mean difference of SBP, the probability that SBP with treatment B is lower than treatment A is 0.67.’
Clinical researchers will appreciate this as they now think that their study proves that ‘2 out of 3 patients will benefit from treatment B’ (despite the insignificant result). Correct me if I am wrong, but I guess the statement should deal with mean SBP.

Unknown: A useful post, thank you. Can you suggest some readings/references for someone new to Bayesian methodology that is specific to the clinical trials context?

*FH•: Thanks for your comments. Scott Berry’s statement was in the context of computing frequentist type I error, which I do not care about. I think that the simulation you described did not calculate the needed probability. You don’t want the proportion of time the posterior probabilities crossed a threshold. That would be related to Bayesian power. What we want is to determine whether the posterior probability at the moment of stopping is well calibrated. I ran one simulation of 10,000 clinical trials with 400 looks at the data (one look after each new subject is added) with a rule to stop when the posterior prob. exceeds 0.95. The average posterior probability at the moment of stopping was 0.96 which exactly equalled the proportion of the 10,000 simulations in which the true efficacy was positive.

There are no multiplicity problems with Bayes. Multiplicity comes from the chances you give data to be more extreme (relevant in the frequentist world) not the chances you give assertions to be true.

Unknown: If I understand the last sentence correctly, then I do not agree with it.

I remember several years ago hearing something similar (but I think I misunderstood at that point, did not hear enough) and tried a test. I simulated a set of 100 values from a binomial (actually Bernoulli) with proportion 0.5 and starting with the 10th observation computed a posterior distribution given a uniform prior and binomial likelihood. From the posterior I calculated the probability that the true proportion was less than 0.5 and had the simulation stop and report the posterior if that probability was less than 0.05. Then I ran this whole process a bunch of times (at least 1,000 but I don’t remember exactly). When I looked at all 100 draws from each simulation then the proportion of posterior probabilities less than 0.05 was about 5%, but if I let the simulation stop early, then the proportion was about 14%. Researcher degrees of freedom and the garden of forking paths can affect Bayesian analysis as well.

Later I saw a presentation by Scott Berry (and later read his book on Bayesian adaptive clinical trials) where he recommended using simulations to choose an appropriate prior and probability cut-off on the posterior to give desired properties of the trial.

The best option for multiple looks at the data with possible early stopping is to use the simulations to choose the prior and stopping rule. At least the Bayesian analysis should honestly report the number of actual and potential looks at the data.

Are these statements about Bayesian vs. frequentist statistics correct?

“Bayesian statistics diverge from frequentist methods by eliminating the conventional reliance on type I and type II error rates. This means that there is no fixed “alpha” level determining significance nor a beta reflecting the power of a test. Instead, it quantifies uncertainty continuously and allows direct probability statements about parameters.”

“Bayesian designs do not eliminate type 1 and type 2 errors. Type 1 and type 2 errors are not bound to a specific design but to a decision of efficacious versus non-efficacious. In that regard Bayesian design also have type 1 and type 2 errors, they are just not evaluated and ignored. In addition, when using informative prior and Bayesian decision criteria the type 1 error is usually just not controlled, especially in a data-prior conflict.”

The first paragraph is correct, but not the second. Regarding the second, that would only be true if using a confusing hybrid Bayesian-frequentist design. \alpha, \beta only exist for Bayes if imposed by some higher authority. The first paragraph would be better written by explaining the the heart of Bayes is about computing probabilities of different events than what the frequentist approach is computing probabilities of. Bayes does not use unobservables and is all about avoiding the problem of the transposed conditional. Bayes computes fundamentally different conditional probabilities.

By transposing the conditional to compute P(data more extreme than those observed), frequentist p-values are distorted by all kinds of contextual problems and what-ifs there are irrelevant to interpreting the data at hand.

1 Like

Thanks!
The first paragraph is our statement in a review manuscript about “Novel Approaches to Clinical Trial Designs,” where we advocate more frequent use of Bayesian methods. The second paragraph is the reviewer’s comments.
I am grateful for any good advice on solving this “situation” :grinning:

Much of this confusion is the excessive emphasis on using pre-data error probabilities, (that characterize the properties of the data collection procedure – ie. experimental design) with the assessment of the actual evidential value of the collected data.

Being a post-data frequentist remains a problem that doesn’t seem to have been noticed in much of medical statistics.

Goutis, C., & Casella, G. (1995). Frequentist post-data inference. International Statistical Review/Revue Internationale de Statistique, 325-344. https://www.jstor.org/stable/1403483

The end result of an experiment is an inference, which is typically made after the data have been seen (a post-data inference). Classical frequency theory has evolved around pre-data inferences, those that can be made in the planning stages of an experiment, before data are collected. Such pre-data inferences are often not reasonable as post-data inferences, leaving a frequentist with no inference conditional on the observed data. We review the various methodologies that have been suggested for frequentist post-data inference, and show how recent results have given us a very reasonable methodology. We also discuss how the pre-data/post-data distinction fits in with, and subsumes, the Bayesian/frequentist distinction.

James Berger, who literally wrote the book on modern decision theory worked out methods that brings Frequentists and Bayesians closer after the data are observed, but I have never seen anyone use them unfortunately.

A Bayesian analogy to the pre-data Type I / Type II error notion would be Jeff Blume’s concept of misleading evidence.

These probabilities often require the use of Bayes theorem in order to be computed, and that presents special problems. Once data are observed, it is the false discovery rates that are the relevant assessments of uncertainty. The original frequency properties of the study design - the error rates - are no longer relevant. Failure to distinguish between these evidential metric leads to circular reasoning and irresolvable confusion about the interpretation of results as statistical evidence.

4 Likes

This is like telling Sir Lawrence Olivier that he’s not good enough because he’s not a gourmet chef.

3 Likes