Multiplicity Adjustments in Bayesian Analysis

I’ve been trying to understand the Bayesian approach to multiple comparison adjustments, and I’ve made a mess of it.
I wanted to please ask for your perception of this. For example, I have recently been reading the results of the trial KEYNOTE-181, published in abstract form at ASCO 2019.

This study had three coprimary endpoints. According to the authors, the P-value required for each contrast was P<0.0075 for ITT, P<0.0075 for the subgroup with squamous tumors, and P<0.0084 for tumors with elevated biomarker level (CPS>10). According to this approach, the study is considered negative for ITT, with P=0.056, also negative for squamous with P=0.0095, and positive for the selected population with biomarker CPS >10 with P=0.0074. This is complicated to interpret, given that there are non-squamous tumors in which the effect seems minor, but which in turn have a high biomarker. Therefore, a Bayesian perspective might help to understand the results.
However, I don’t know how to approach the issue of triple comparison from the Bayesian approach. Please any insights that a clinician like me could understand?


I go into this in detail here. Briefly, there is no need no nor method for multiplicity adjustment with Bayes for situations like the one you described. In a nutshell,

  • Rules of evidence dictate that direct evidence for assertion A can be assessed without reference to assertion B. Example: the assessment of guilt of a new criminal suspect does not need to be discounted because there was a previous suspect.
  • Multiplicity adjustment in the frequentist domain comes from giving data more chances to be extreme. Bayesian posterior probabilities do not involve chances for data to be more extreme. They involve the chances for unknown effects to have given values. The latter is in the parameter space unlike frequentism which operates in the data space (sample space). Constraints on chances for effects to be extreme come from prior distributions, and you might have three independent prior densities for three endpoints.

And if the study were to be critically reanalyzed using a Bayesian approach, would any special consideration have to be given to priors selection? Could a weakly informative geretic prior be used, e.g. normal(0,1) in all three cases?


I have seen at least two main ways to see priors: Conveniant tools to apply regularization, or a true representation of your prior beliefs.

Personally I think if you really want to commit to taking a Bayesian approach then priors need to be selected to match with your expectations of a given problem. A normal(0, 1) prior might not make sense for a model of the effect of beta blockers on blood pressure.

The trick with the bayes multiple testing sims I’ve often seen is that treatment effects are simulated in a way consistent with the Bayesian idea of parameters being random and then the prior for the model is matched to this exactly. The idea is that this shows that the posterior will always be the “correct” combination of prior and likelihood. This is why Bayesians don’t really need to prove things via simulation, as long as priors and likelihood are right then posterior is an accurate representation of current belief. This will (probably) not be very comforting to frequentists in the sense that they are typically interested in error control, so you’ll find you are talking across purposes a lot of the time.

Also note that both frequentists and bayesians can incorporate prior information via model structure. It might make sense to treat treatment effects as exchangeable via a random effect if, for example, everything is measured on the same scale and you think treatment works similarly for each outcome. Sander Greenland has a nice paper on this from a frequentist perspective I believe (I don’t think for multiple outcomes, but just general use of multi-level models for regularization).

1 Like

I was thinking subtleties with probability are very interesting. The probability that a subject A is guilty does not change the probability that B could be the true killer. However, asking for a forensic genetic test is not the same as asking for a thousand. In relation to this, I can follow at least 3 approaches. 1) I can evaluate two endpoints in the same population. 2) I can divide a population into two subgroups and perform the same contrasts of hypothesis twice. 3) Given that the 2 subgroups are different I can can propose different hypothesis, for example assuming different therapeutic effect, as if they were different trials done at the same time.
The question from the metaphor of the killer would be, should I approach the adjustments by multiplicity in the same way in the three cases?
And the calculation of statistical power?

I think your example is making things unnecessarily complicated. But let me discuss a fairly complicated example that may shed some light. Suppose that one had 100,000 candidate SNPs in a genome-wide association study. For each SNP compute the Bayesian posterior probability that the odds ratio against a certain outcome variable Y exceeds 1.25. Then

  • find the subset of SNPs for which the posterior probability > 0.97 and consider them to be very likely have a non-trivial association with Y
  • find the subset of SNPs for which the post. prob < 0.03 and consider them to be very likely to have a trivial or no association with Y
  • label the remainders as “we don’t know”

The third subset will be huge and the experiment will not have shed much light, if N is not huge. The first two subsets are likely to be small. But you can interpret all the SNPs without direct reference to each other.

1 Like

I find this very interesting, but I’m not sure I got it right. My interpretation is that the use of specific or skeptical priors limits the most extreme estimates. Therefore, Bayesian statistics only palliate the problem of multiplicity if these priors are used. If neutral priors are used, the problem of multiple comparisons would be the same. After all, the result would be very similar to the frequentist.
Is this correct and does it correspond to what you wanted to say? If so, the question would be to discern to what extent a skeptical prior is able to influence the probability of randomly finding estimates of extreme effect. Is that it?

No you are confusing the sample space with the parameter space. Priors limit the unknown parameter values, not the data. Data extremes are irrelevant to Bayes. Parameter extremes are irrelevant to frequentism. Frequentist multiplicity is solely about data extremes.

1 Like

@f2harrell, I want to verify I’m understanding you. I have a project with n= ~400 patients looking at risk for medication toxicity Y (dichotomous event) against several clinical covariates X and a list of 50 candidate SNPs Gj for j = 1…50 each with 3 indexed levels (homozygous wild type, hetzg, hom risk allele) that were curated from the literature. Assume I ran the Bayesian logistic regressions of the form logit(Y) = 1 + b1X + b2G. For computational convenience and the sake of example, this is not a hierarchical model but rather the same model run 50 times, with 1 SNP in each model.

My question: Is a prior for b2 ~ N(0,0.5) appropriate for this situation? From the literature, I don’t expect a large effect size, but since they are drawn from the literature, I also feel justified that I am skeptical but not extremely skeptical whether they could have an effect. I’d also like to contrast this with an alternative candidate approach of suppose several hundred SNPs were drawn from the relevant pharmacological pathway for the drug. Would a more skeptical prior be appropriate - something like N(0,0.1)? Would the same apply at the scale of the whole genome?


Thanks for the question. It may be better to do a unified analysis with all 50 candidate SNPs. But if you do separate analyses you are always assuming in effect that the 50 priors are independent.

But to your point, there are two issues going on:

  • multiplicity such as discussed at the top of this topic, which does not really apply to Bayes since you can have evidence about each SNP standing on its own
  • getting the model right and being properly skeptical

On the second bullet, “getting the model right” might mean getting the “population” of SNP effects to be what we expect from biology. If you think this “population” is sparse, e.g., you think there is a small number of SNPs with decent sized effects, then you would specify in the Bayesian model a shrinkage prior like the horseshoe prior and analyze all 50 jointly.

If you did separate analyses I suppose you could have a N(0, 0.1) prior for each SNP’s log odds ratio.

1 Like

this new covid trial is interesting: probably have to look at the sap to see exactly what they did but it’s an ordinal composite as primary, bayes used, interim analysis with no adjustment? id think theyve been reading this message board but the didnt use proportional odds modelling

1 Like

Odd that the paper described statistical analysis methods for secondary endpoints but not for the primary! Also odd to see “Professor” in the author list.

1 Like

I don’t understand Bayesian methods enough to say why, but when I read this paper, there are things that feel a bit “off” (?) I’d be interested to hear some expert stats opinions and to read other papers using Bayesian methods that have been published by the authors.

There are no time-oriented multiplicities in a Bayesian approach; problems only surface when you use a hybrid Bayesian-frequentist attack or a frequentist analysis. What do you think is off?

deborah mayo discussed it in a recent blog post: Should Bayesian Clinical Trialists Wear Error Statistical Hats? (i) | Error Statistics Philosophy

There have been a number of threads on the Bayesian perspective on multiple comparison adjustments. This particular one stands out for its dialogue between Frank and Sander Greenland. The entire thread (starting with the first post) is worth study.

An interesting observation – properly adjusted p values will have the same properties as likelihood ratios, in that the probability of misleading assertion decreases as the information increases. Royall gives the universal bound on the probability of a LR favoring B when A is true is \frac{1}{k}.

One of the first meta-analytic procedures reported was Tippett’s min p value method, expressed as \alpha^* = 1-(1- \alpha)^\frac{1}{k} where \alpha^* is the pre-specified Type I error of the combination procedure and \alpha is the Type I error for looking at a single study.

If the smallest p-value of the data set (assuming the tests are independent) is less than the \alpha^* derived by Tippett’s equation, the default assumption that there was no signal in any study is rejected.

Unfortunately, the researcher can only conclude there was at least one study that detected an effect. Other nonparametric combination methods have been developed that are a bit more useful, but that is for another thread.

This blog post is based on the assumption the “error probabilities” are necessary, with no justification provided. It is telling that what Mayo is describing is not an error probability at all. That is, it is not the probably of making an error or the probability that an assertion is erroneous. It is nothing more than the probability that you will make an assertion of efficacy when the treatment is ignorable. Her approach is inconsistent with optimal Bayes decisions, which nowhere use assertion probabilities. Misleading nomenclature leads readers of Mayo to unclear thoughts IMHO.


To clarify, I’m not implying anything nefarious when I say “off” (poor word choice). The study was clearly a huge undertaking, which is commendable, especially under very difficult conditions. And since the numbers in the trial look favourable for fluvoxamine, it’s important that the results be clearly understood by clinicians. I think what I’m trying to articulate is that even after reading the study a couple of times, I’m not quite sure, clinically, what to do with the results.

Maybe the problem is that I’m just not experienced in reading Bayesian analyses (most likely). But if that’s the case, then I’m probably not the only MD who will struggle. Or maybe part of the problem is the choice of the primary endpoint (see below) (?)

Here are a few study excerpts with questions below each:

“We applied a Bayesian framework for our primary outcome analysis and a frequentist approach for all sensitivity analyses and secondary outcomes.”

Q: Is this a common approach? Are there any potential problems that arise from using mixed methods?

“Posterior efficacy of fluvoxamine for the primary outcome is calculated using the beta-binomial model for event rates assuming informed priors based on the observational data for both placebo and fluvoxamine…”

Q: Under the heading “Comparison with Prior Evidence,” the authors mention a previous small US RCT which recorded few outcomes of interest, as well as an observational study from France. I can’t figure out how/whether they used the results of these trials to demonstrate “informed” priors or what “informed” priors even means in this trial (?) And if they did use results of such a small RCT to inform their prior, was this a good idea?

“Based on the beta binomial model, there was evidence of a benefit of fluvoxamine reducing hospitalization or observation in an emergency room for greater than six hours due to COVID-19 (Relative Risk [RR]: 0.71; 95% Bayesian Credible Interval {BCI}: 0.54-0.93)…The probability that the event rate was lower in the fluvoxamine group compared to placebo was 99.4% for the ITT population…”

Q: How would I present this finding to a patient if the overall “positive” trial result is partly being driven by a reduction in how long they might have to spend in an ER rather than the chance that fluvoxamine will help them to avoid hospitalization or death? What phrasing would I use when talking to the patient? Ideally, I’d like to be able to say: “If I treat you (an ostensibly “high risk” patient) with fluvoxamine early in your infection course, the chance that I will be able to reduce your risk of death or hospitalization by more than “X” percent is “Y”. Does the current study presentation permit such an interpretation?

Q: Re Table 3- to what extent is the “ER visit for at least 6 hours” component of the composite primary outcome driving the overall trial result? If ER visit >6 hours is driving the result, what is the clinical significance of this finding?

1 Like

You raise a lot of excellent points. To address just some of them:

This is common and unfortunate. It is due only to the statisticians running out of time.

Whenever an informative non-skeptical prior is used, there must be a lot of justification. Use of observational data is especially problematic unless it’s only used to inform non-treatment-effect parameters.

Compound endpoints need to “break the ties” but treating more severe outcomes are worse than less consequential ones. But no matter how compound endpoints are handled it is important for the investigators to shed light on all component outcomes. Because the sample size will not be adequate for the more infrequent outcomes it is often necessary to borrow information across different outcomes. For example one can use the partial proportional odds model with a skeptical prior for the amount by which the treatment effects mortality differently than it affects nonfatal outcomes.


just fyi, results were presented here: they dont get to Fluvoxamine until near the end (slides not numbered)

1 Like