Synopsis:
There exist clinical questions where effect size aggregation procedures are inappropriate. In the context of a highly heterogeneous data, frequentists can use various p value combination procedures. The research hypothesis would be:
Blockquote
“Assuming there is zero effect in all studies, is this collection of research results sufficiently surprising for me conclude there is at least 1 study with a true effect?”
For most clinicians, that metaanalytic hypothesis is not very satisfying or informative.
A Bayesian alternative, based on the work of Robert Matthews and James Berger would extend the work on Bayesian Analysis of Credibility and the calibration of p values to Bayes factors to provide a qualitative evidence measure for the direction of effect.
The result is the aggregated evidence (expressed as betting odds) of the primary information contained in the individual reports.
For a set of N studies, the Bayesian combination procedure is: exp(\sum_{1}^Nln(BFB))
where BFB (Bayes Factor Bound) is 1/(e * p * ln(p)), and p is the p value for a single study.
If a study shows benefit vs control, keep the sign of ln(BFB) positive. Otherwise, multiply ln(BFB) by 1.
Context:
A very large number of professional associations have adopted the philosophy of “evidence based decision making” as it evolved in medicine, without considering the context in which their members practice. In my field of physical rehabilitation and disabilities, this has created a number of constraints that the following position paper beautifully describes:
Challenges of Evidence in Rehabilitation and Disability.
https://ktdrr.org/ktlibrary/articles_pubs/ncddrwork/tfse_challenge/index.html
Evidence of these challenges can be demonstrated by examining the research on neuromuscular electrical stimulation (NMES) for the rehabilitation of stroke. A strong case can be made that reasonable clinical hypotheses can be asked (Does NMES improve recovery from CVA caused motor deficits?), but formal tools such as effect size metaanalytic techniques are not appropriate for currently available data.
A narrative review described the following:

Since NESS H200 (Bioness Inc.)28,29 is the only commercially available FES device for the upper limb and hand (Figure 4B), more robust and versatile devices are required for a wider group of people. Since hand or upper limb motion is more diverse in comparison to lower limb motion, the electrical stimulation system associated with stimulating the former is also complicated.

The evidence of the applications of NMES in rehabilitation is still limited.
A relatively recent metaanalysis on NMES for CVA (2015) was published in The Archives Physical Medicine and Rehabilitation
Functional Electrical Stimulation Improves Activity After Stroke: A Systematic Review With MetaAnalysis (2015)
https://www.archivespmr.org/article/S00039993(15)000441/fulltext
Looking at the descriptions of the primary reports, the authors aggregated studies that used NMES on the UE and LE, and had a host of significant methodological differences in terms of outcome measurements. Some studies used an ordinal outcome measurement, others used a reduction in time on a standardized test as the outcome measurement. All were aggregated using the standardized mean difference.
Most clinicians who would be motivated to conduct a metaanalysis can be forgiven for not realizing that this procedure is problematic. For example:
The author constructs an example of an outcome measurement with ratio units, and conducts a metaanalysis using the natural units as well as using the standardized effect. The standardized effect metaanalysis is more heterogeneous and inconsistent with the natural units version of the analysis.
Individuals who have experienced a stroke represent a clinically heterogeneous population. This clinical heterogeneity leads to methodological heterogeneity related to various possible assessments to measure effect magnitude. These two sources of heterogeneity lead to a wide array of procedures for statistical analysis and data preparation. This add another layer of statistical heterogeneity that is very slowly being noticed. Readers should note that ordinal logistic regression is one of the recommended techniques. I’d say it should be the preferred technique, considering the data we are working with.
Can We Improve the Statistical Analysis of Stroke Trials?
https://www.ahajournals.org/doi/10.1161/strokeaha.106.474080
Compare the above peer reviewed metaanalysis with the following guidelines from the AHRQ:
Quanitative Synthesis Methods – An Update
https://effectivehealthcare.ahrq.gov/products/methodsquantitativesynthesisupdate/methods
The critical question they advise asking is: “Is it appropriate to pool the available studies?”
In the case of the stroke rehabilitation literature on this topic, a strong case could be made that it should not.
The references and youtube video in this post by @Sander was very useful in helping me thing about this issue:
Video Summary
Robert Matthews has written a number of papers on the Bayesian Analysis of credibility, using the frequentist confidence intervals and assumptions to derive what he calls the “honest skeptic” and “honest advocate” priors. We then elicit our own prior by comparing the derived prior with what we understand about the contextual information.
Jim Berger and a number of colleagues have done work on calibrating p values with Bayes Factors (aka Marginal Likelihood). In the video and in this paper, he argues for the reporting of e * p * ln(p) as the upper bound on the evidence for a directional (nested) alternative. He goes into details on why this is useful for both frequentists and Bayesians in the video.
Calibration of p Values for Testing Precise Null Hypotheses
https://amstat.tandfonline.com/doi/abs/10.1198/000313001300339950
In the context of a highly heterogeneous data, frequentists can use various p value combination procedures. But for most clinicians, the metaanalytic hypothesis is not very satisfying or informative.
A proposal
I’d like a Bayesian alternative to frequentist p value aggregation that builds upon Matthew’s Bayesian Analysis of Credibility. It would move researchers away from naive, mistaken interpretations of p values, and encourage thinking about future experiments in a decision theoretic context.
For a set of N studies, I’d calculate: exp(\sum_{1}^Nln(BFB)) where BFB (Bayes Factor Bound) is 1/(e * p * ln(p)), and p is the p value for the study. If a study shows benefit vs control, keep the sign of ln(BFB) positive. Otherwise, multiply ln(BFB) by 1.
That calculation would be the most optimistic scenario for the advocate, who should know that this factor should be discounted for various sources of bias and error.
Likewise, the skeptic should rethink his/her beliefs if background information is inconsistent with even a discounted report of the aggregate data. From there, it becomes easier to list what predictive variables to account for, and what possible biases that need to be handled in a Bayesian decision theoretic framework.
Addendum: Here is a direct link to the Youtube video featuring James Berger, Sander Greenland, and Robert Matthews on the misuse of P values, and alternatives for correct interpretation.
Some Related Threads