Bayesian Qualitative Analysis of Credibility for Meta-analysis of Heterogeneous Data from RCTs


There exist clinical questions where effect size aggregation procedures are inappropriate. In the context of a highly heterogeneous data, frequentists can use various p value combination procedures. The research hypothesis would be:

“Assuming there is zero effect in all studies, is this collection of research results sufficiently surprising for me conclude there is at least 1 study with a true effect?”

For most clinicians, that meta-analytic hypothesis is not very satisfying or informative.

A Bayesian alternative, based on the work of Robert Matthews and James Berger would extend the work on Bayesian Analysis of Credibility and the calibration of p values to Bayes factors to provide a qualitative evidence measure for the direction of effect.

The result is the aggregated evidence (expressed as betting odds) of the primary information contained in the individual reports.

For a set of N studies, the Bayesian combination procedure is: exp(\sum_{1}^Nln(BFB))
where BFB (Bayes Factor Bound) is 1/(-e * p * ln(p)), and p is the p value for a single study.

If a study shows benefit vs control, keep the sign of ln(BFB) positive. Otherwise, multiply ln(BFB) by -1.

A very large number of professional associations have adopted the philosophy of “evidence based decision making” as it evolved in medicine, without considering the context in which their members practice. In my field of physical rehabilitation and disabilities, this has created a number of constraints that the following position paper beautifully describes:

Challenges of Evidence in Rehabilitation and Disability.

Evidence of these challenges can be demonstrated by examining the research on neuromuscular electrical stimulation (NMES) for the rehabilitation of stroke. A strong case can be made that reasonable clinical hypotheses can be asked (Does NMES improve recovery from CVA caused motor deficits?), but formal tools such as effect size meta-analytic techniques are not appropriate for currently available data.

A narrative review described the following:

  1. Since NESS H200 (Bioness Inc.)28,29 is the only commercially available FES device for the upper limb and hand (Figure 4B), more robust and versatile devices are required for a wider group of people. Since hand or upper limb motion is more diverse in comparison to lower limb motion, the electrical stimulation system associated with stimulating the former is also complicated.

  2. The evidence of the applications of NMES in rehabilitation is still limited.

A relatively recent meta-analysis on NMES for CVA (2015) was published in The Archives Physical Medicine and Rehabilitation

Functional Electrical Stimulation Improves Activity After Stroke: A Systematic Review With Meta-Analysis (2015)

Looking at the descriptions of the primary reports, the authors aggregated studies that used NMES on the UE and LE, and had a host of significant methodological differences in terms of outcome measurements. Some studies used an ordinal outcome measurement, others used a reduction in time on a standardized test as the outcome measurement. All were aggregated using the standardized mean difference.

Most clinicians who would be motivated to conduct a meta-analysis can be forgiven for not realizing that this procedure is problematic. For example:

The author constructs an example of an outcome measurement with ratio units, and conducts a meta-analysis using the natural units as well as using the standardized effect. The standardized effect meta-analysis is more heterogeneous and inconsistent with the natural units version of the analysis.

Individuals who have experienced a stroke represent a clinically heterogeneous population. This clinical heterogeneity leads to methodological heterogeneity related to various possible assessments to measure effect magnitude. These two sources of heterogeneity lead to a wide array of procedures for statistical analysis and data preparation. This add another layer of statistical heterogeneity that is very slowly being noticed. Readers should note that ordinal logistic regression is one of the recommended techniques. I’d say it should be the preferred technique, considering the data we are working with.

Can We Improve the Statistical Analysis of Stroke Trials?

Compare the above peer reviewed meta-analysis with the following guidelines from the AHRQ:

Quanitative Synthesis Methods – An Update

The critical question they advise asking is: “Is it appropriate to pool the available studies?”

In the case of the stroke rehabilitation literature on this topic, a strong case could be made that it should not.

The references and youtube video in this post by @Sander was very useful in helping me thing about this issue:

Video Summary
Robert Matthews has written a number of papers on the Bayesian Analysis of credibility, using the frequentist confidence intervals and assumptions to derive what he calls the “honest skeptic” and “honest advocate” priors. We then elicit our own prior by comparing the derived prior with what we understand about the contextual information.

Jim Berger and a number of colleagues have done work on calibrating p values with Bayes Factors (aka Marginal Likelihood). In the video and in this paper, he argues for the reporting of -e * p * ln(p) as the upper bound on the evidence for a directional (nested) alternative. He goes into details on why this is useful for both frequentists and Bayesians in the video.

Calibration of p Values for Testing Precise Null Hypotheses

In the context of a highly heterogeneous data, frequentists can use various p value combination procedures. But for most clinicians, the meta-analytic hypothesis is not very satisfying or informative.

A proposal

I’d like a Bayesian alternative to frequentist p value aggregation that builds upon Matthew’s Bayesian Analysis of Credibility. It would move researchers away from naive, mistaken interpretations of p values, and encourage thinking about future experiments in a decision theoretic context.

For a set of N studies, I’d calculate: exp(\sum_{1}^Nln(BFB)) where BFB (Bayes Factor Bound) is 1/(-e * p * ln(p)), and p is the p value for the study. If a study shows benefit vs control, keep the sign of ln(BFB) positive. Otherwise, multiply ln(BFB) by -1.

That calculation would be the most optimistic scenario for the advocate, who should know that this factor should be discounted for various sources of bias and error.

Likewise, the skeptic should re-think his/her beliefs if background information is inconsistent with even a discounted report of the aggregate data. From there, it becomes easier to list what predictive variables to account for, and what possible biases that need to be handled in a Bayesian decision theoretic framework.

Addendum: Here is a direct link to the Youtube video featuring James Berger, Sander Greenland, and Robert Matthews on the misuse of P values, and alternatives for correct interpretation.

Some Related Threads

1 Like

Here is a blog post critical of the mapping of p values to Bayes factors. I still believe this mapping has value, but will need to think about the arguments presented here.


I do not think the author of the blog post considered all ways of using p values.
His critique focuses on hypotheses that are related to the magnitude of the difference from the null, as well as the least informative hypothesis of H_1 \ne H_0.

Retrospectively, any reader is free to convert the 2 sided p value to a 1 sided directional hypothesis.

As for his main points – there is no function that maps p values to effect sizes (magnitudes), he is correct.

If we were to use an experiment to settle a bet where the point null of absolutely no effect (where the effect is \mathbb{R}) is somewhat plausible, how would we negotiate what procedure settled the question with our counter-party who said it was not, but did not pick a direction? If we had a choice of setting the the p value, or fixed sample size N which should we choose?

If we decide to fix the p value, our counter-party could choose a high N and guarantee a “significant” result. Therefore, we can conclude there is no one to one mapping of p values to effect sizes.

(In reality, we would want to select some maximal level for p and an upper bound for N).

However, from a meta-analysis POV, we can compare the Bayesian conversion of one sided p values (where there is a mapping between the sign of the effect, and a p value) to the Frequentist p value combination procedures. If we compel our counter-party to select a direction, he/she would be required to have some belief about the outcome before agreeing to the bet.

The Bayes Factor Bound, in the one tailed scenario, provides the best case odds on the direction of the effect. If those are reported (or can be calculated) from an individual study, based on the law of likelihood, they provide an alternative to the Frequentist p value combination techniques.

If I remember the theory correctly, there is a problem in converting a 2 tailed frequentist test, to a Bayesian one. There isn’t any problem converting a frequentist 1 tailed test to a Bayesian version.