Nonparametric Effect Size estimation, Likelihood Methods, and Meta Analysis

Posted as a new topic at Prof. Harrell’s request.

Study heterogeneity is a big issue in research synthesis, and is related to various issues in robust statistics.

I have an intuition that those involved in research synthesis might be better served by using nonparametric estimates of effect (Hodges-Lehmann estimates/generalized odds ratios, etc), instead of necessarily relying upon standardized mean differences reported by primary researchers.

But I am also persuaded by the work of Richard Royall and Jeffrey Blume that the likelihood perspective is the proper way to evaluate and synthesize statistical evidence.

Can anyone refer me to any recent research that discusses either robust likelihood methods, or nonparametric likelihood methods in the context of meta analysis? I am aware of the work on Empirical Likelihood by Art Owen, and some work on generalized likelihood by Fan and Jiang, but it is a bit over my head at the moment, and not directly related to the issues of research synthesis.

The impression I get is that likelihood methods can be computationally challenging, which is why things like p values and NHST persist.

Some references:
Generalized Likelihood Ratio Statistics and Wilks Phenomenon

1 Like

Just running out but this is an interesting argument I’d like to see hashed out more. Are you just referring to analysis of ordinal data here? It’s worth noting that if ordinal data are reported fully, you can just do an ordinal meta-analysis (e.g. for skin outcomes in Psoriasis).

Why is it the proper way?

“Just running out but this is an interesting argument I’d like to see hashed out more. Are you just referring to analysis of ordinal data here? It’s worth noting that if ordinal data are reported fully, you can just do an ordinal meta-analysis (e.g. for skin outcomes in Psoriasis).”

No, I am thinking about how to integrate various studies, but relaxing the parametric assumptions virtually everyone reports.

I am not sure if you have read any of the work done in robust statistics, but the initial work was done as mixture distributions – ie 90% prob of observation from normal, 10% prob from normal with large variance. I’m thinking the scenario a meta analyst faces is similar to a normal + random error distribution. How would you extract information from that scenario? A nonparametric technique should be able to handle it.

The use of parametric inference might be debatable in many cases, but that does not mean the information contained in the summaries is useless. Means and standard deviations are sufficient statistics (for the normal distribution), so I can (in principle) recover the information contained in the reported sample for re-analysis using a different (hopefully more robust in aggregate) effect measure.

The Generalized Odds ratio is a distribution free measure of effect related to the Wilcoxon test, and the Hodges-Lehmann estimator (the pseudo-median). It simply reports which group had higher scores, as an odds ratio.

With care and caution, I should be able to aggregate these effect ratios to provide an estimate of the odds of improvement that requires very few assumptions to interpret.

A gap in the recommendations for applied researchers exists. For example, if one author reports conventional parametric procedures, but another author used nonparametric techniques, there is no easy way to synthesize them into a plausible effect. You might argue (as is done in the link I cited) that the papers using nonparametrics are more credible, but omitted.

There is a short book chapter floating around the internet that discusses how to convert among various effect sizes, but that chapter does not go into detail on the distributional assumptions involved in their meaning. I do not like simply converting a Wilcoxon-Mann-Whitney (WMW) odds to a standardized mean difference.

On the contrary, I would do the opposite, and convert the standardized mean to the Generalized Odds/WMW Odds ratio. It is a broader effect measure related to the standardized mean difference, but carries fewer assumptions on the distribution.

As to the proper way to aggregate evidence, the Likelihood viewpoint has me convinced. I recommend reading anything Jeffrey Blume has written to get started.

Richard Royall is also essential reading. For a Bayesian POV, Michael Evans has some cutting edge work in the realm of relative belief.

The challenge will be to extend these parametric likelihood techniques to the nonparametric realm. Some work has been done within the past 20 years in that regard in terms of generalized likelihood, but to be honest, it is beyond my current level of “mathematical maturity”. My intuition is that there will be a lot of overlap between nonparametric likelihood and Bayesian nonparametrics.

I should thank Tim for mentioning the term “ordinal meta analysis.” Certain things were not appearing in searches with various combinations of “ordinal” + “meta analysis”.

It wasn’t a term that had occurred to me to search for, and turned up a lot of prior work (some relatively recent) done in CVA outcome literature. I am more familiar/comfortable with the term “nonparametric” or “distribution free”, because I see the benefits of nonparametric tests and estimates.

“Ordinal” statistics seems to have the connotation of “less power” that I think is mistaken.

There are some gaps in my knowledge that I can now slowly fill.

Added 4/2/2019:

Looking through the Cochrane Handbook, I found a useful citation that confirmed my intuition regarding odds ratios.

As a difference in NED is the effect size [1], a meta-analysis of ln(odds) is equivalent, albeit with loss of power, to one of eect size except for the scaling factor of 1.81.

4/9/2019: Some updated links

  1. Christian Robert (Author of Bayesian Choice) reviews Michael Evan’s Statistical Evidence and Relative Belief

  2. Michael Evans responds.

  3. Detailed paper on the formal statistical hypotheses modelled by T and Wilcoxon. Also refutes the notion that transforming data into ranks loses power. Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules.

4/15/2019: Added links:

4/17/2019 – Nice technical report from Agency for Healthcare Research and Quality on Meta-Regression from 2004 (entire report is available online – 70 pages).

Michael Evans takes as gospel that every aspect of analysis must be data-checked. This is completely at odds with the known excellent properties of Bayesian borrowing of information and of prior specification that “kicks off” the analysis in reasonable fashion. For example, Andrew Gelman has written at length about the need for informative priors that help prevent dumb mistakes by just fitting the data.

1 Like

That is an interesting observation; I have not made my way through his entire book on Relative Belief, but there are aspects from my scanning of it, that seemed appealing.

I admit I find the Bayesian philosophy very attractive. But my attitude in terms of meta-analysis currently is to avoid the whole problem of prior specification and look at things from a Likelihood point of view – at least when doing work in research synthesis. Should a prior be needed, I would “cheat” and go the Empirical Bayes route. This would be the easiest improvement I could likely “sell” to colleagues (ie be able to effectively communicate to them).

I work in rehabilitation; mathematical sophistication is not very high (myself included, but I am improving). Many studies are small. Many improperly use parametric methods on inherently ordinal data. Many commit the errors you describe in BBR. I am trying to figure out how much info can be extracted from the reports without being nihilistic and saying “nothing”.

I want constructive (in the mathematical sense) measures of quality that can be defended when including, excluding, or discounting the reports of various studies. That is why I am a big fan of Jeff Blume’s work which I found through your blog a few months ago.

I fear producing naive Bayesians will be worse than naive frequentists. I am trying to figure out how to balance the need for rigor while minimizing the need for calculus and analysis to use these modern techniques. (no pun intended).

1 Like

I would just add that naive Bayesians are easily found out. It is much easier to hide frequentist assumptions, especially related to the sample space and to the real unknown accuracy of p-values and confidence limits.


I definitely agree with the problems with p values and testing. I do think the theoretical work done in the testing area can still be useful when translating the problem into an estimate.

I cannot recall where exactly I read this (perhaps in Fixation of Belief), but in Charles Sanders Peirce’s pragmatist epistomology, both skepticism and belief need justification. In spirit, this consistent with the mathematically constructive point of view, where we do not argue about “truth” but about proof.

(Constructivists do not use the technique of proof by contradiction; constructive proofs have interpretations as algorithms.)

This constructive attitude is what makes Michael Evan’s work somewhat attractive to me, but have yet to study in depth.

What do you think about the use of likelihood methods to nudge people towards a more Bayesian POV?

In a research synthesis from a decision theoretic POV, the output would be to aid a decision maker on whether a policy can be justified using prior research, or to collect more data. If so, what is the justified skeptical prior constrained by the data (appropriately weighted and adjusted for constructive measures of quality which come from likelihood theory). From a Bayesian decision theory view, then the appropriate study design can be formally derived.

Ideally, data could be used to assist in both group policy making as well as individual decision making (aka prediction), with necessary modifications.

Unfortunately, this seems to lead me into empirical bayesian nonparametrics, which is anything but simple mathematically, and I hesitate to even try to introduce this to most clinician/consumers of the research literature.

Edit: 11/3/2019 – I’m thinking of creating a BibTeX database of all the links related to meta-analysis in this thread. If anyone is interested, I’d be happy to share it on github.

I’m adding some especially relevant citations related to the assumption of normality in meta-analytic procedures:

Especially useful is Table 3: (Eight main assumptions made by conventional methods for meta‐analysis)

The recommendation in the discussion section was the use of logistic regression.

Comparing these results to those in Section 2.1 (μˆ=0.65 with standard error 0.20), this analysis is in reasonable agreement with the conventional analysis presented above. Given the small study sizes, and so the crudeness of the conventional methods, it is perhaps surprising that the inferences from the logistic regression are so similar. The alternative common‐effect Mantel‐Haenszel method used in the Cochrane Review has also slightly, but to a lesser extent, diluted the estimated treatment effect.

Calling the DerSimonian Laird procedure “unreliable” is serious.

I strongly recommend this paper by David Hoaglin. It is probably the best article I’ve seen on the technical issues related to logistic regression and generalized linear mixed effects models for meta-analysis to date.

Further elaboration of various versions of the Generalized Mixed Effects Model model for meta-analysis using logistic regression.

The authors of the original DerSimonian Laird Meta-analysis technique reflect on its widespread use and modify it with robust variance estimates.

An empirical evaluation of meta-analytic procedures. Suffice it to say GRADE criterion undervalue “low quality” studies and overvalue “high quality” ones.

The aggregation of effect sizes on anything other than interval and ratio scaled variables is much more complicated than I initially anticipated. You would not know this from most of the readily available texts on meta-analysis, as far as I can tell.

This discussion hinted at the problems:

Relevant article by Stephen Senn on Effect size measurements, particularly non-parametric ones.

Stephen Senn (2011) U is for Unease: Reasons for Mistrusting Overlap Measures for Reporting Clinical Trials

Stehpen Senn (2000) The Many Modes of Meta

Some issues related to standardized effect:
Peter Cummings MD, MPH (2004) Meta-analysis Based on Standardized Effects Is Unreliable

Peter Cummings MD, MPH (2011) Arguments for and Against Standardized Mean Differences (Effect Sizes)

Milo A Puhan, Irene Soesilo, Gordon H Guyatt & Holger J Schünemann (2006) Combining scores from different patient reported outcome measures in meta-analyses: when is it justified?

1 Like

I’m adding at least 2 new links as they are related to this problem.

Here is an interesting question where there isn’t an easily accessible answer in the standard meta-analysis texts:

After reading Senn’s article (posted above) on using overlap measures as effect sizes, in a meta-analysis of medians, I’d calculate the Mann, Whitney, Wilcoxon log odds as described in the Missing Medians paper (link above) if the U statistics were available. If not, I’d have to calculate the log odds assuming a shift in the logistic distribution.

Here is a pre-print that recommends the use of weighted median effect sizes at the meta-analytic level to reduce the need to model publication bias. I had thought this was a plausible idea, but I think the HL estimate would be even better.

Here is an interesting critique of “quality scores” for individual studies in meta-analysis.

The following is a great paper in my field of rehabilitation, that describes the problems with making naive resource allocation decisions or drawing conclusions about effectiveness from improper analyses of ordinal data. Everything the author warned of had come true. I don’t know why I had not found it sooner.

One of the many key quotes:

The regular use of rating scales by many institutions tends to give them credence. Administratively and editorially sanctioned mathematical mistreatment of these data gives the appearance that such treatment has been found to be scientifically valid. Despite the most noble of intentions, the misuse of rating scale data encourages retention of misinformation. Even if we are astute enough to recognize methodologic errors while reading an article, we will often remember the author’s conclusions. In clinical practice, we tend to remember the numbers associated with a specific patient, even though we know them to be of questionable validity. This cannot help but shape our thinking and practice.