Multiplicity Adjustments in Bayesian Analysis

I’ve been trying to understand the Bayesian approach to multiple comparison adjustments, and I’ve made a mess of it.
I wanted to please ask for your perception of this. For example, I have recently been reading the results of the trial KEYNOTE-181, published in abstract form at ASCO 2019.

This study had three coprimary endpoints. According to the authors, the P-value required for each contrast was P<0.0075 for ITT, P<0.0075 for the subgroup with squamous tumors, and P<0.0084 for tumors with elevated biomarker level (CPS>10). According to this approach, the study is considered negative for ITT, with P=0.056, also negative for squamous with P=0.0095, and positive for the selected population with biomarker CPS >10 with P=0.0074. This is complicated to interpret, given that there are non-squamous tumors in which the effect seems minor, but which in turn have a high biomarker. Therefore, a Bayesian perspective might help to understand the results.
However, I don’t know how to approach the issue of triple comparison from the Bayesian approach. Please any insights that a clinician like me could understand?


I go into this in detail here. Briefly, there is no need no nor method for multiplicity adjustment with Bayes for situations like the one you described. In a nutshell,

  • Rules of evidence dictate that direct evidence for assertion A can be assessed without reference to assertion B. Example: the assessment of guilt of a new criminal suspect does not need to be discounted because there was a previous suspect.
  • Multiplicity adjustment in the frequentist domain comes from giving data more chances to be extreme. Bayesian posterior probabilities do not involve chances for data to be more extreme. They involve the chances for unknown effects to have given values. The latter is in the parameter space unlike frequentism which operates in the data space (sample space). Constraints on chances for effects to be extreme come from prior distributions, and you might have three independent prior densities for three endpoints.

And if the study were to be critically reanalyzed using a Bayesian approach, would any special consideration have to be given to priors selection? Could a weakly informative geretic prior be used, e.g. normal(0,1) in all three cases?


I have seen at least two main ways to see priors: Conveniant tools to apply regularization, or a true representation of your prior beliefs.

Personally I think if you really want to commit to taking a Bayesian approach then priors need to be selected to match with your expectations of a given problem. A normal(0, 1) prior might not make sense for a model of the effect of beta blockers on blood pressure.

The trick with the bayes multiple testing sims I’ve often seen is that treatment effects are simulated in a way consistent with the Bayesian idea of parameters being random and then the prior for the model is matched to this exactly. The idea is that this shows that the posterior will always be the “correct” combination of prior and likelihood. This is why Bayesians don’t really need to prove things via simulation, as long as priors and likelihood are right then posterior is an accurate representation of current belief. This will (probably) not be very comforting to frequentists in the sense that they are typically interested in error control, so you’ll find you are talking across purposes a lot of the time.

Also note that both frequentists and bayesians can incorporate prior information via model structure. It might make sense to treat treatment effects as exchangeable via a random effect if, for example, everything is measured on the same scale and you think treatment works similarly for each outcome. Sander Greenland has a nice paper on this from a frequentist perspective I believe (I don’t think for multiple outcomes, but just general use of multi-level models for regularization).

1 Like

I was thinking subtleties with probability are very interesting. The probability that a subject A is guilty does not change the probability that B could be the true killer. However, asking for a forensic genetic test is not the same as asking for a thousand. In relation to this, I can follow at least 3 approaches. 1) I can evaluate two endpoints in the same population. 2) I can divide a population into two subgroups and perform the same contrasts of hypothesis twice. 3) Given that the 2 subgroups are different I can can propose different hypothesis, for example assuming different therapeutic effect, as if they were different trials done at the same time.
The question from the metaphor of the killer would be, should I approach the adjustments by multiplicity in the same way in the three cases?
And the calculation of statistical power?

I think your example is making things unnecessarily complicated. But let me discuss a fairly complicated example that may shed some light. Suppose that one had 100,000 candidate SNPs in a genome-wide association study. For each SNP compute the Bayesian posterior probability that the odds ratio against a certain outcome variable Y exceeds 1.25. Then

  • find the subset of SNPs for which the posterior probability > 0.97 and consider them to be very likely have a non-trivial association with Y
  • find the subset of SNPs for which the post. prob < 0.03 and consider them to be very likely to have a trivial or no association with Y
  • label the remainders as “we don’t know”

The third subset will be huge and the experiment will not have shed much light, if N is not huge. The first two subsets are likely to be small. But you can interpret all the SNPs without direct reference to each other.

1 Like

I find this very interesting, but I’m not sure I got it right. My interpretation is that the use of specific or skeptical priors limits the most extreme estimates. Therefore, Bayesian statistics only palliate the problem of multiplicity if these priors are used. If neutral priors are used, the problem of multiple comparisons would be the same. After all, the result would be very similar to the frequentist.
Is this correct and does it correspond to what you wanted to say? If so, the question would be to discern to what extent a skeptical prior is able to influence the probability of randomly finding estimates of extreme effect. Is that it?

No you are confusing the sample space with the parameter space. Priors limit the unknown parameter values, not the data. Data extremes are irrelevant to Bayes. Parameter extremes are irrelevant to frequentism. Frequentist multiplicity is solely about data extremes.

@f2harrell, I want to verify I’m understanding you. I have a project with n= ~400 patients looking at risk for medication toxicity Y (dichotomous event) against several clinical covariates X and a list of 50 candidate SNPs Gj for j = 1…50 each with 3 indexed levels (homozygous wild type, hetzg, hom risk allele) that were curated from the literature. Assume I ran the Bayesian logistic regressions of the form logit(Y) = 1 + b1X + b2G. For computational convenience and the sake of example, this is not a hierarchical model but rather the same model run 50 times, with 1 SNP in each model.

My question: Is a prior for b2 ~ N(0,0.5) appropriate for this situation? From the literature, I don’t expect a large effect size, but since they are drawn from the literature, I also feel justified that I am skeptical but not extremely skeptical whether they could have an effect. I’d also like to contrast this with an alternative candidate approach of suppose several hundred SNPs were drawn from the relevant pharmacological pathway for the drug. Would a more skeptical prior be appropriate - something like N(0,0.1)? Would the same apply at the scale of the whole genome?


Thanks for the question. It may be better to do a unified analysis with all 50 candidate SNPs. But if you do separate analyses you are always assuming in effect that the 50 priors are independent.

But to your point, there are two issues going on:

  • multiplicity such as discussed at the top of this topic, which does not really apply to Bayes since you can have evidence about each SNP standing on its own
  • getting the model right and being properly skeptical

On the second bullet, “getting the model right” might mean getting the “population” of SNP effects to be what we expect from biology. If you think this “population” is sparse, e.g., you think there is a small number of SNPs with decent sized effects, then you would specify in the Bayesian model a shrinkage prior like the horseshoe prior and analyze all 50 jointly.

If you did separate analyses I suppose you could have a N(0, 0.1) prior for each SNP’s log odds ratio.

1 Like