A Bayes-Frequentist approach to the Analysis of Credibility of meta-analyses using p-values

I wanted to post this for some input to see if there was any flaw in my reasoning for the proposed procedure. It incorporates a lot of ideas from dozens of papers and posts. Following the ideas from
A Theory of Experimenters, this description is how an individual with no need to convince an external audience, might use it. There are some adaptations I can imagine that would bring it in line with Robert Matthew’s Analysis of Credibility, to model a dialogue with a skeptical audience.

Statistical practice changes slowly because the teaching of statistics changes slowly … Once sociologists and physicians have learned about significance levels well enough to use them, a major reorganization of the thought process is required to adapt to decision theoretic or Bayesian analysis … With the help of theory, I have developed insights and intuitions that prevent me from giving weight to data dredging and other forms of statistical heresy. This feeling of freedom and ease does not exist until I have a decision theoretic, Bayesian view of the problem. I am a Bayesian decision theorist in spite of my use of Fisherian tools.
Herman Chernoff in commentary on “why Isn’t Everyone Bayesian” by Bradley Efron (1986)

Procedure Description

  • sequence of signed test statistics and/or 2 sided p-values,
  • Prior and Posterior Odds,
  • Sequence of Posterior Odds changes as number of studies increase,
  • Sequence of changes \alpha allocation as studies increase (see this article on adaptive \alpha levels),
  • Expected Power for individual study
  • Step procedure (Holm or Hochberg)
  • P value combination procedure (Stouffer-Liptak, Fisher, Logit)

Statement of Posterior Odds There exists s of N studies indicated a clear direction of effect, where

  • s = H(r) + C(b)
  • H(r) = null rejected by Hochberg step-up method (or Holm step-down)
  • C(b) = -1 or +1 if combination procedure of studies not rejected in Hochberg are rejected at remaining alpha, 0 otherwise.

The familywide \alpha level will be set using Bayesian reasoning. Following Cheng and Sheng, \alpha will be allocated between a step procedure and a combination procedure. The input for combination procedure will be the remaining statistics not rejected by step procedure. Following Naaman and Pericchi and Periera, familywide \alpha will decline as number of studies increases. iIn addition, less error will be allocated to the combination procedure, permitting more power for the sequential testing procedure to find a local effect.

All values will be specified by user prior to data collection as part of SAP (statistical analysis plan).

Rationale: Contrary to Chernoff’s quote, current discussions of a scientific replication crisis indicates that “significance” levels are not understood well enough to use them correctly in a large areas of science. This suboptimal state of affairs continues to exist because:

Classical Fisherian significance testing is immensely popular because it requires so little from the scientist… – Bradley Efron Large Scale Inference: Empirical Bayes for Large Scale Testing, Estimation, and Prediction (p. 48)

This major reorganization of thought begins by placing elementary statistical procedures in a Bayesian context. Particularly important are are correction for multiplicity, and the information fusion approach via p-value combination have Bayesian and information theoretic justifications.

Omnibus p-value procedures remain useful and research into extensions of these methods is ongoing.
Hedges and Olkin (1985) note:

An important application of omnibus test procedures is to screen for any effect … Alternatively, combined test procedures can be used to combine effect size analyses based on different outcome variables.

In the context of meta-analysis, it is far from unusual to have what would ordinarily be a small sample (ie. no more than 20 studies). This limits the application of preferred methods (ie. effect size aggregation or meta-regression) when explanation of variability would be useful. Rather than produce a misleading effect size estimate from heterogeneous studies, it would be preferable to compare and contrast the local results of individual studies to design more informative ones given current knowledge.

The following procedure builds upon Cheng and Sheng (2017)

Their procedure divides the familywise \alpha among 2 uncorrelated p-value combination procedures. This \alpha division is essentially a Bonferroni adjustment for multiple tests.

Taking guidance from Bayarri, Berger, Benjamin, and Selke (2016), Pericchi and Periera (2016), and Naaman (2016), this Bayes-Frequentist procedure will overcome limitations of classical combination procedures, and provide a more intuitive Bayesian interpretation.

Related threads:


Why would this be at all useful in science? Shouldn’t we always aim to be our own, harshest critics – Einsteins, not amoebas?

From an information theoretic (and decision theoretic) sense, this is not realistic, nor necessary.

Scientific concerns and decision theoretic ones coincide, when maximizing information is equivalent to maximizing utility. An agent might not have enough of a budget to run a “definitive” experiment, and must factor in the cost of information within the context with which the decision will be made.

Put another way, being a “harsh critic” perspective might entail having such a low \alpha for your experiment that it has no power to detect anything surprising.

It sounds to me as if you are concerned with the problem of constructing beliefs that are ‘useful’, rather than constructing theories that are true. (I can’t help pointing out that we have in the US a neofascist party utterly committed to the programe of constructing what are for them highly useful ‘beliefs’, without regard for truth). By making budgetary (and other such) excuses for ourselves, and setting out to find a science that is useful, we obtain neither science nor usefulness. I’m sure you’ve heard the old saw about how ‘we ultimately have to make a clinical decision’ and so must dichotomize.

I don’t want to get bogged down into philosophy or politics, although I agree that the Feynman idea of a scientist with integrity seems to have been long forgotten.

Suffice it to say there is modification that can incorporate a range of priors to model the dialogue between an honest advocate and an honest skeptic that will satisfy your concerns. I just want to know if I am making some glaring error in logic before I write any simulations.

Related Thread:

1 Like

Arguably, your #inductivism IS the logical error :wink: . It will be interesting to see if your simulations prove telling on this point. I think sufficiently good simulations will!

I don’t understand @R_cubed Robert’s motivation well enough. The procedure he outlined is a complex mix of Bayesian and frequentist ideas. Mixing the two seems to cause nothing but confusion. And the use of odds implies point hypotheses for both reference and alternative conditions. Even though zero is special in a certain way for a reference hypothesis, there is no special alternative value. So the procedure should use posterior distributions instead of odds even if it mixes in some frequentist ideas. But I would recommend just going with a Bayesian hierarchical model and being done with it.

Regarding multiplicity, we will stay clear if we distringuish likelihood-modifying multiplicity from non-likelihood-modifying multiplicity.


:new: Fair question. Sticking strictly to frequentist criteria, do you see any issue where the familywide \alpha would be higher than the nominal for this method? Is there any frequentist reason why someone would reject the procedure? That seems to be the most critical point for me ATM.

As for the “alternative condition” – in the analysis of p-values the default assumption that none of the studies are surprising enough to reject the no effect model (ie no clear sign in any study). By placing odds on this omnibus hypothesis allows me to have a procedure that states “These specific studies indicate a clear direction (sign), and this set of studies indicate at least 1 additional study with direction, with a posterior odds of X to 1” that will satisfy both Frequentists and Bayesians.

By using the step procedure, I’m able to argue that the estimated signs are the “true” signs based upon the frequentist criteria. For the studies that aren’t rejected, they still contain information about the sign, and at least according to the frequentist interpretation, and count as at least 1 study.

One problem I commonly find in rehabilitation is the aggregation of heterogeneous studies with different outcome scales. Take for example this meta-analysis of functional electrical stimulation for CVA. or the meta-analysis I linked to above on ACL injury prediction.

It has been awhile since I read the CVA paper, but they lumped together a variety of patient populations, outcome assessments, and statistical procedures using the “standardized mean difference” which we know is problematic. There was a very similar meta-analysis published in that journal a decade earlier, using similar analysis methods.

For the sake of discussion, assume that each study used reasonable procedures for the particular study conducted (ie. locally rational). I think various ordinal scales provide information; I also think interval and ratio measurements also provide value.

But in the face of such heterogeneity of methods, populations, etc, it isn’t clear how to combine them on an estimation scale to argue to a skeptic (who engages in what Sander calls nullism) that either better studies would be useful, or there is at least a preponderance of evidence to believe an intervention is worthwhile.

I have no doubt your suggestion is correct – when there is a well defined question to ask (and answer). In rehabilitation, we are usually in the stage where there is a jungle of ill-specified models, along with misapplied frequentist methods, to an equally large variety of assessment methods.

Other motivations:

  1. Clinicians will continue to see p-values for a long time to come for the very reason Efron noted – they don’t ask for much from the researcher. Since p-values have a Bayesian interpretation as a lower bound for the sign of the effect, there is no need to stick with the fixed cut-offs used in textbooks. I view this as part of a campaign to embrace, extend, and then extinguish this misleading “null” hypothesis significance testing and move toward more decision theoretic ideas.

  2. Suppose we have a meta-analysis that finds “no evidence” via effect size averaging, but with 1 study is rejected (with multiplicity adjustment), but the aggregate of other studies is rejected with the opposite sign. We can then open the discussion to the differences among the studies, talk about covariates that need to be accounted for, and then introduce some of the ideas from RMS on what future studies should account for.

  3. The method of analysis should be determined by the amount of information contained in the data.
    A large majority of meta-analyses in rehabilitation contain much less than 20 studies. Given the sources of heterogeneity above, I don’t see how textbook suggestions like meta-regression would be helpful. It is more honest to discuss direction of effect with small samples, and then figure out how to estimate it appropriately.

  4. I can later argue for your ordinal longitudinal approach after they understand p-values well enough to comprehend that the parametric methods used on ordinal data don’t even calculate a valid p-value (or estimate) that can be used in a synthesis.

I have other motivations, but that is good for now. I always appreciate your input.

1 Like

Boiling down the question to whether a sequential add-one-study-at-a-time procedure has any \alpha issues in the frequentist sense, it depends on what you might do upon completing one of the early analyses. If you might stop with a conclusion that you have evidence for an effect, then it seems that multiplicity adjustment is needed. Bayesians can argue that this frequentist procedure is not very logical, but if you are going to be a frequentist you should probably be a frequentist.

1 Like

One application is more of a sanity check on published syntheses that mangle the analysis.

One last question: Safe to assume you aren’t all that much enthusiastic for the Berger proposal in his Bayesian Rejection Ratio paper? This very method was used in large N genetic data to substantially improve prediction regarding future studies (by dramatically lowering \alpha for individual tests).

If there is a complaint – it is the notion of “Rejection” that might be troublesome for those who prefer estimation. Perhaps a better title would be “The unreasonable effectiveness of Bayesian methods to derive frequentist procedures.”

I think it may miss the point. The point is what is the prior for the signals in the high dimensional setting. This prior can be viewed equally as the prior information about the effect of one feature, or the population distribution of effects of all features together. This prior should be chosen carefully. The horseshoe prior is one good possibility. Then all posterior inference flows without modification for dimensionality.

1 Like