Basic References for Bayes Methodologies a Journal Can Use

I was recently appointed as an Associate Editor of a specialty journal. I am in a position wherein I may be able to influence the journal’s editorial policies. I’d like to move this journal away from frequentists data analyses to Bayes methodologies.

I’d like to be able to tell authors something like “Please convert the statistical comparison you did between 2 groups of continuous data from a t-test to the Bayes equivalent. Report the main outcome as a probability distribution for that outcome”

I’m seeking advice on how realistic this may be. If it is, give further advice on how to implement this. Is there a source I can refer authors to for them to learn about Bayes? It is now easy to write statistical programs by doing Bayes analysis using AI. If analysts do this, how can one ensure that the analyses are done and interpreted properly?

This may seem like a tall order, but I may have a chance to swing the pendulum in at least one discipline away from frequentist toward Bayes Statistical analysis.

A further challenge is that most of the authors for this journal have limited access to statisticians. Despite this, there are substantial resources available online that can assist authors and editors in accomplishing all this.

Thanks

Ed

4 Likes

This wouldn’t be a Neurosurgery journal, by any chance? :wink:

1 Like

Good question, particularly since Bayesian workflows are indeed no longer considered that exotic. There are already some nice guidelines available in journals such as here and contemporary discussions, e.g., here. Many users who start familiarizing themselves with Bayes use the free point-and-click software JASP which includes nice tutorials on Bayesian equivalents of various tests, including t-tests (see also pdf guide for students).

Of course, it is hard to beat R and Stan when it comes to Bayes, for which case Richard McElreath’s book and courses are an excellent introductory resource with the book’s second edition onwards including also an excellent introduction to applied causal inference using Bayes.

Would be careful with large language model tools. They can generate runnable Bayesian code, but often hallucinate and misinterpret models and/or require careful review and tweaking of the code they produce.

2 Likes

I’d add a suggestion for John Kruschke’s work – he has a very accessible textbook (Doing Bayesian Data Analysis) that covers a lot of the “what is the Bayesian equivalent of this test” and has also written some useful reporting guidelines (Bayesian Analysis Reporting Guidelines | Nature Human Behaviour)

5 Likes

I’d addd these two books, which I’ve found useful (and others have said they found helpful)

3 Likes

In his critique of Bayesian methods in meta-analysis, Stephen Senn highlights several key concerns. He argues that Bayesian approaches introduce subjectivity, rely heavily on often arbitrary or incoherent prior specifications, and do not inherently resolve the limitations of frequentist methods. Vague priors, particularly for nuisance parameters like between-study variance, can lead to implausible or unstable inferences. Assumptions such as independence between treatment effects and heterogeneity are often unrealistic, and Bayesian results are difficult to generalize or reuse unless others share the same priors. Moreover, Bayesian simulations frequently misrepresent real-world inference by comparing subjective models to fixed “truths,” and sophisticated computational methods like MCMC cannot compensate for poor model assumptions. Senn ultimately emphasizes the continued value of frequentist approaches—especially fixed-effect models—when trial heterogeneity is poorly understood, cautioning against a blanket shift to Bayesian methods.

A journal should maintain a balance, favoring approaches that emphasize transparency, reproducibility, and robustness—qualities that frequentist methods often provide more readily. A blanket switch to Bayesian analyses could undermine these principles and sounds very much like what BASP did in banning p values and uncertainty intervals – there was no real gain in terms of quality of the journal articles when reviewed by Daniel Lakens. Indeed, Senn suggests using Bayesian techniques as supplementary tools for sensitivity analysis rather than as primary methods. Senn, for example, states that: “It is unreasonable to expect a Bayesian analysis to produce some automatic answer that is acceptable to all as the relevant posterior distribution to use. Indeed, a Bayesian analysis is effectively unusable by anyone not sharing the prior distribution that has been used to produce it.”

No, not a neurosurgery journal but one that is open to new ideas (JAMA was not)

To extend the analogy further: When I started surgical training 40 years ago, a surgeon did all the tasks necessary to provide surgical care. Then came PA’s and NP’s. Now, the neurosurgeon may do the operation, but almost all the other perioperative care is provided by individuals with substantially less training.

Statistics have also evolved. When I was a research fellow at UCLA, we had unfettered access to very senior statisticians. One simply went to their offices and got help. In those days access to statistical knowledge and computer programs was very limited.

Now, if I want to talk to a statistician, I need to pay a hefty consultation fee. However, very sophisticated statistical software is freely available, and AI can answer almost any question one has about statistical analysis. Given that we do not have infinite resources to pay for statistical consultation, we rely on what we can get done on our own.

An example: I asked Grok to design a methods section and power analysis for an adaptive trial. I am posting it below. It is remarkably similar to a similar study design I paid a consultant about $1,000 a few years ago to write.

Just as the role of neurosurgeons has changed, so has it for statistical consultants. Authors will continue to do their own analyses and neither they nor the journals have the resources to pay for statistical consultation. Given this reality, we need to find the optimal balance between having expert input and having a process to minimize errors when that input is not available.

Grok: “Write a sample method section describing the study power calculation for an adaptive trial design Using bayes methodologies”

Methods

Study Design

This phase II clinical trial utilized a Bayesian adaptive design to evaluate the efficacy of a novel therapeutic agent (Intervention A) compared to standard care (Control B) in patients with moderate-to-severe inflammatory disease. The trial was a randomized, double-blind, parallel-group study with a 1:1 allocation ratio. The primary endpoint was the proportion of patients achieving clinical remission at 24 weeks, defined as a binary outcome (remission vs. no remission).

The adaptive design incorporated interim analyses at prespecified enrollment milestones (after 40%, 60%, and 80% of the initial target sample size) to allow for early stopping for efficacy, futility, or sample size re-estimation. Decisions were based on Bayesian posterior probabilities, enabling data-driven modifications while maintaining trial integrity. The initial target sample size was 200 patients (100 per arm), with a maximum allowable increase to 300 patients if interim results were inconclusive.

Bayesian Framework

A Bayesian hierarchical model was employed for inference on the primary endpoint. Let pA and pB denote the true remission rates for Intervention A and Control B, respectively. The number of remissions in each arm followed a binomial distribution: YA ∼ Binomial(nA, pA) and YB ∼ Binomial(nB, pB), where nA and nB are the sample sizes per arm at each analysis stage.

Non-informative prior distributions were assigned to the remission rates: pA ∼ Beta(1, 1) and pB ∼ Beta(1, 1), equivalent to a uniform prior on [0,1]. This choice reflected limited prior knowledge and allowed the data to dominate posterior inference. At each interim analysis, posterior distributions were updated using Markov Chain Monte Carlo (MCMC) methods implemented in R (version 4.2.1) with the rjags package, drawing 10,000 samples after a 5,000-iteration burn-in to ensure convergence (assessed via Gelman-Rubin diagnostic <1.1).

Adaptive rules were as follows:

  • Efficacy stopping: Stop early if the posterior probability P(pA > pB + δ | data) > 0.975, where δ = 0.10 (minimum clinically important difference).
  • Futility stopping: Stop early if P(pA > pB + δ | data) < 0.05.
  • Sample size re-estimation: If 0.05 ≤ P(pA > pB + δ | data) ≤ 0.975, increase enrollment by up to 50 patients per arm, based on predictive simulations of trial success.

Final success at trial completion was declared if P(pA > pB + δ | data) > 0.95.

Power Calculation (Simulation-Based Assurance)

Given the adaptive nature and Bayesian decision rules, traditional frequentist power calculations were not directly applicable. Instead, we computed the trial’s operating characteristics, including Bayesian assurance (the predictive probability of trial success averaged over plausible parameter values) and frequentist-like metrics (e.g., empirical power and type I error rate), via Monte Carlo simulations.

Simulations were conducted using R software. We generated 5,000 replicated trials under a range of scenarios for the true remission rates:

  • Null scenario: pA = pB = 0.30 (no treatment effect).
  • Alternative scenarios: pA = 0.45, pB = 0.30 (moderate effect); pA = 0.50, pB = 0.30 (strong effect); and boundary cases (e.g., pA = 0.35, pB = 0.30).
  • Priors were incorporated by first sampling true rates from the prior distributions (for assurance) or fixing them at scenario values (for conditional power).

For each replicated trial:

  1. Patient data were simulated sequentially according to the true rates and binomial model.
  2. Interim analyses were performed at the specified milestones, updating posteriors and applying adaptive rules (e.g., stopping or re-estimating sample size).
  3. If the trial continued to completion, the final posterior was evaluated against the success criterion.

The Bayesian assurance was calculated as the proportion of simulations resulting in a successful outcome (declaring superiority of Intervention A), averaged over the prior distribution of effect sizes. Conditional power was the success proportion given fixed true rates under the alternative. Type I error was the false-positive rate under the null.

Results indicated:

  • Assurance of approximately 82% for detecting a moderate effect (δ = 0.15), assuming priors.
  • Conditional power of 85% under the primary alternative scenario (pA = 0.45, pB = 0.30).
  • Type I error controlled at <2.5% under the null.
  • Average sample size across simulations: 185 patients (range: 120–280), demonstrating efficiency gains from adaptations.

Did you vet the accuracy of these calculations, beyond it looking similar to something you saw in the past? If not, I would argue that is it negligence to just stick it in a SAP. “AI can answer almost any question one has about statistical analysis” - willingness to provide “an answer” does not speak to the veracity of that answer. Accuracy of ChatGPT on Statistical Methods

1 Like

Using your neurosurgery analogy, isn’t asking Grok to design a trial akin to asking Grok to do the brain operation? Trial design is orders of magnitude more complex than writing post-op orders…

As noted in other threads, I think many physicians drastically underestimate the expertise required to design and interpret a trial properly. And, arguably, a lot more people’s lives are at stake if trial is designed poorly… Frankly, I think anyone who wants to experiment on humans should be required to pay the fee of a professional applied statistician- IMO, to NOT do this would be unethical. I second the post above- outsourcing trial design to AI at this time would be completely irresponsible.

3 Likes

Also, using Grok is one heck of a signalling device.

These are excellent suggestions-Thanks!

1 Like

I think the suggestion to have both methods used and comapred is a good one. It might be useful to ask authors to prepare a supplement to a MS with aa Bayes analysis and comapre it to the frequentist one that they will have used initially.

1 Like

I feel that a blanket change to Bayes would do far more good than harm. That’s because the current statistical paradigm is highly flawed, and the parts of it that aren’t flawed are usually misinterpreted anyway. Of course you can do bad things with Bayes, but I’m convinced we need wholesale change to break out of the current situation.

4 Likes

I think $1000 for a qualified statistician is pretty cheap. A good plan typically takes 20-100 hours of work to complete, especially if the statistician does simulations of how the planned analysis will perform.

To get back to Ed’s original question, and thinking also of his comment that the authors of the journal may not have sufficient access to qualified statisticians, I suggest that we all brainstorm a phased approach. Here is a very rough sequence:

  • Set up the journal so that there is full statistical review no matter which paradigm is used
  • Invite Bayesian methods, clearly stating their advantages and giving pointers to resources such as those listed by other above (and I’ll shamelessly push Bayes). Also highly the greater actionability and interpretability of Bayesian analysis, with simple examples like “yes we can actually compute the probability that a treatment works; we don’t always have to compute the probability of getting impressive data if it doesn’t”.
  • Have a special section of the invitation that urges researchers to consider Bayesian sequential designs that don’t need any arbitrary sample size calculation and can have any number of stopping criteria.
  • After a while, give a review or publication incentive for using Bayes
  • After another while, start criticizing submissions for lost opportunities where a Bayesian approach would have been better.

Concerning the AI-designed adaptive trial, it’s not half bad. But it’s very poor on the choice of priors and perhaps on the metric on which to base the priors. I’d like to see such AI used more by the statistician than the PI, with the statistician correcting AI mistakes then engaging with the PI to refine to optimally meet clinical goals.

6 Likes

A Bayesian result is, in a sense, a posterior probability of a probability. It is less intuitive for the patient and the clinician than a plain absolute risk reduction with a confidence interval.

Unfortunately, some report the posterior probability of a relative risk, what makes me think of “the posterior probability of a probability ratio”.

I think the interpretation should be facilitated for those on the “reading side” of the journal, specially the clinicians that might benefit from the information.

1 Like

You say “Please convert the statistical comparison you did between 2 groups of continuous data from a t-test to the Bayes equivalent”. But the problem is that there is no single Bayesian equivalent of a t test. I’d suggest continuing to give a p value, but supplementing it with a Bayesian test that uses a skeptical prior, in the sense that it puts a lump of prior probability (commonly 0.5) on the point null. eg for smallish samples, use James Berger’s approach, or even míne. Both of these approaches suggest that, if you have observed p=0.05 in a well-powered experiment and you declare a discovery on that basis, the chance that you are wrong (the false positive risk) is 25 - 30%
https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1543135
https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622

2 Likes

Makes sense, just will add $.02 that any Bayesian t-test that has more than 2 parameters is going to be an improvement and give more realistic (wider) uncertainty intervals.

1 Like

An absolute risk reduction is intuitively easy to understand, but a confidence interval… isn’t.

3 Likes

And most clinicians and a good many statisticians don’t really that absolute risk reduction must be covariate-specific.

2 Likes

Do you have any advice for how to check on code developed by LLMs? I’ll assume that many authors will do this and not know exactly what to look for in the results. What should we be looking for from a journal perspective?

2 Likes