That sounds very interesting - do you have a shareable copy that I could see?
Many thanks @Sander , not the response I was hoping for but I really appreciate the time and thought put into the answer.
I don’t know enough about what has already been done here, but it seems timely-- given the growing recognition that p-values are problematic but without popular solutions being put forth-- to reanalyze the results of a couple of hundred trials with a skeptical prior.
Dear Pavel,
I offer the following response from the perspective of an applied statistician whose main tasks have been analyzing data and writing up the results for research on focused problems in health and medical science (as opposed to, say, a data miner at Google):
Contextual narratives I see in areas I can judge are often of very questionable validity. So are most frequentist and Bayesian analyses I see in those areas. Bayesian methods are often touted as a savior, but only because they have been used so infrequently in the past that their defects are not yet as glaring as the defects of the others (except to me and those who have seen horrific Bayesian inferences emanating from leading statisticians). Bayesian methodologies do provide useful analytic tools and valuable perspectives on statistics, but that’s all they do - they don’t prevent or cure the worst problems.
All the methods rely rather naively on the sterling objectivity, good intent, and skills of the producer. Hence none of these approaches have serious safeguards against the main threats to validity such as incompetence and cognitive biases such as overconfidence, confirmation bias, wish bias, bandwagon bias, and oversimplification of complexities - often fueled by conflicts of interest (which are often unstated and unrecognized, as in studies of drug side effects done by those who have prescribed the drug routinely).
To some extent each approach can reveal deficiencies in the other and that’s why I advocate doing them all in tasks with severe enough error consequences to warrant that much labor. I simply hold that it is unlikely one will produce a decent statistical analysis (whether frequentist, Bayesian, or whatever) without first having done a good narrative analysis for oneself - and that means having read the contextual literature for yourself, not just trusting the narrative or elicitations from experts. The latter are not only cognitively biased, but they are often based on taking at face value the conclusions of papers in which those conclusions are not in fact supported by the data. So one needs to get to the point that one could write up a credible introduction and background for a contextual paper (not just a methodologic demonstration, as in a stat journal).
Statistics textbooks I know of don’t cover any of this seriously (I’d like to know of any that do) but instead focus all serious effort on the math. I’m as guilty of that as anyone, and understand it happens because it’s way easier to write and teach about neat math than messy context. To remedy that problem without getting very context-specific, and what I think is most needed and neglected among general tools for a competent data analyst, is an explicit systematic approach to dealing with human biases at all stages of research (from planning to reviews and reporting), rather than relying on blind trust of “experts” and authors (the “choirboy” assumption). That’s an incredibly tough task however, which is only partially addressed in research audits - those need to include analysis audits.
It’s far harder than anything in math stat - in fact I hold that applied stat is far harder than math stat, and the dominant status afforded the latter in the statistics is completely unjustified (especially in light of some of the contextually awful analyses in the health and med literature on which leading math statisticians appear). That’s hardly a new thought: Both Box and Cox expressed that view back in the last century, albeit in a restrained British way, e.g., see Box, Comment, Statistical Science 1990, 5, 448-449.
So as a consequence, I advocate that basic stat training should devote as much time to cognitive bias as to statistical formalisms, e.g., see my article from last year: “The need for cognitive science in methodology,” American Journal of Epidemiology , 186, 639–645, available as a free download at https://doi.org/10.1093/aje/kwx259.
That’s in addition to my previous advice to devote roughly equal time to frequentist and Bayesian perspectives on formal (computational) statistics.
Hi @Sander I came across this post from reading up on Magnitude based decisions which has been described as a statistical cult
I went to the author of the method (MBD) and he is using this Q and A as sort of a justification for MBD
After reading both, maybe I am misunderstanding, but I don’t see how the comments here justify MBD
I will have to look at Magnitude Based Inference (aka Magnitude Based Decisions) a bit more carefully, but Jeffrey Blume has developed an approach that is superficially similar called Second Generation P values. His approach is derived from likelihood theory, and has attractive properties I don’t think the MBI/MBD people are cognizant of yet.
I have not yet proven to myself Blume’s Second Generation P values and MBI are equivalent. It may be the case that these scholars stumbled upon Blume’s good idea, but do not have enough background statistical knowledge to demonstrate correctness.
I’m adding for reference an openly published paper that examines the method in light of known statistical principles by Welsh and Knight (referenced by Sander below).
Magnitude-based Inference: A Statistical Review (link)
The Vindication of Magnitude-Based Inference (link)
The statistical reviewer was Kristin L. Sainani, PhD from Stanford, who is critical of MBI/MBD:
https://web.stanford.edu/~kcobb/
Her critique is here. It looks like a thorough analysis. I managed to find freely available copy below:
The Problem with “Magnitude-based Inference” (link )
I am somewhat concerned that there are no references to the statistical literature in my scan of the PDF you linked to. For example:
I forgot to add Dr. Blume’s American Statistician paper on Second Generation P values, where he goes into a bit more theory.
An Introduction to Second-Generation p -Values (link)
yeah one of the weird things about MBI is there isn’t really a peer reviewed mathematical formulation that isn’t the MBI founders website.
Regarding MBx and SGPV, I see them both as rediscovering (in my view, in less clear form than the original sources) the ideas of testing interval hypotheses, as set forth in Hodges & Lehmann 1954:
https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1954.tb00169.x
and extended in many sources since, including books. Lakens et al. provide a basic review of standard interval-testing methods here:
https://journals.sagepub.com/doi/full/10.1177/2515245918770963
and Lakens and Delacre can be viewed as showing that a basic “equivalence” test procedure (TOST, which actually tests nonequivalence) is preferable to the SGPV: https://osf.io/7k6ay/download
As for MB “inference”, one way it can be salvaged is by recasting it as MB decisions, taking the intervals it uses and applying standard interval-testing methods (like TOST) to them. The chief disputes then seem to come down to
- the alpha levels one should use,
- the terminology for the multiple possible results, and
- how the results should be linked to Bayesian interpretations.
These are problems shared by any NP test, including the infamous point-null hypothesis tests. There is currently some discussion toward producing a consensus manuscript with opposing parties that explains and settles all this, but I cannot say when or even whether it will emerge.
I suspect the problem MBI attempted to address could have been solved more clearly and uncontroversially by graphing the upper and lower P-value functions with the interval boundaries marked, or tabulating the upper and lower P-values at the boundaries, and then explaining the decision rules based on the resulting graph or table. One could also graph the maximum-likelihood ratios (likelihoods divided by the likelihood maximum) or include them in the table of boundary results.
Hi @Sander Thank you for your answer.
When you say salvage - are you saying MBD as currently implemented in the excel spreadsheets available here
Is not doing what it says it is doing? I guess more specifically would you recommend using MBD as it’s currently framed/implemented or encourage people to do something like Hodges & Lehmann.
As a broader question maybe @f2harrell can help too, should people be using MBD in its current form and its current implementation.
I should emphasize again the problem is not the math of MBD - as far as I can see, that’s just doing a set of hypothesis tests on intervals as per H&L 1954 and many sources since. The debate instead concerns where to set its default Type-I error ceilings (alpha levels) and how to label its outputs.
That’s important but really no different than the endless debate around the same issues for basic statistical hypothesis testing: For example, should we label failure to reject “accept”? No, because people will inevitably misinterpret that label as “accept as true”, not as shorthand for “decide for now to continue using the hypothesis” or something similar. And should 0.05 or 0.005 or anything be mandated as a universal default for alpha? (as pseudoscientific journals like JAMA do). No because any default will be a disaster in some situations, those in which power and cost (risk) considerations would tell us to use a very different alpha. Thus forced alpha defaults are like forcing all speed limits to a default of 50 mph whether in a school zone or on a straight 4-lane desert interstate highway. Efforts to do better are what need to be mandated.
Really, the only issue with MBD is how to describe it accurately but succinctly, and how to deal carefully with its settings. Arguably it may be best to simply translate it into existing interval-testing theory and describe it that way. Any such rethinking ought to be reflected in spreadsheet revisions.
As with all methods, however, even the most careful revisions will not stop abuse of MBD. But then, if abuse were a sufficient reason to ban a method then all familiar frequentist and Bayesian methods should be banned, and the life expectancy for any replacement will be very, very short.
Thank you for your reply @Sander
I was reading this
https://europepmc.org/abstract/med/29683920
My takeaway from it is that the actual math behind MBD/MBI is the problem.
For example we have this to be taken from the article
“If I was ever to peer review a paper using magnitude-based inference then I would reject it and tell the authors to completely redo their analysis,” says Professor Adrian Barnett, president of the Statistical Society of Australia.
We also have these slides from Welsh
I think this is a nice example why its so hard for people outside of statistics to even pick and choose which methods to use and why I am so thankful for @f2harrell for providing this forum so people like myself without backgrounds can ask questions to experts.
for example @f2harrell has called it complete BS
Again thank you for your comments @Sander really helping me understand the issues at hand
When I look closely at the criticisms of MBI, the main substance of what I see is complaint about unusually high alpha-level settings. Now high settings are not unheard of in regulatory contexts, e.g. see the ICH document link below, which explains the choice of 0.25; that however appears to be 2-sided and implicitly predicated on the idea that false negatives may be more costly than false positives in their context.
Regarding the Welsh-Knight critique: It seems the current dispute arose because the MBI originators promoted their entire package without attention to these details, and apparently without knowledge of the existing statistical literature on the problem they were addressing. They then supplied no power-cost rationale for their alpha settings in the contexts at issue (or at least supplied none acceptable to their critics). They also offered questionable Bayesian interpretations of their test results and doubled down on some contestable claims, eliciting hard counterattacks.
My view is that conversion from MBI to MBD (a change in description) and its absorption into standard testing theory should follow the advice (which can be found in Neyman’s own writings) to make alpha user-supplied. That calls for careful instructions to users to provide a rationale for what is chosen in terms of power (or expected false-negative rates) and the costs or risks of each decision, specifying that this is best done and documented at the design stage of the study. And then editors and reviewers should require the rationale and documentation from authors. Again, this is a general problem that testers should address; e.g., see https://www.researchgate.net/publication/319880949_Justify_your_alpha
There should also be explication of how any Bayesian interpretation requires a further rationale for the prior implicit in the inversion (which many would argue should not be part of the standard output).
ICH Q1E Guideline links to the International Conference on Harmonization (ICH) document Q1E; see paragraph B.2.2.1 on p. 10:
“Before pooling the data from several batches to estimate a retest period or shelf life, a preliminary statistical test should be performed to determine whether the regression lines from different batches have a common slope and a common time-zero intercept. Analysis of covariance (ANCOVA) can be employed, where time is considered the covariate, to test the differences in slopes and intercepts of the regression lines among batches. Each of these tests should be conducted using a significance level of 0.25 to compensate for the expected low power of the design due to the relatively limited sample size in a typical formal stability study.”
Once again thank you @Sander for your explanation. Is it your view that the conversion from MBI to MBD was done in such a way that users should feel comfortable with the tools provided at the moment.
Once again thank you for your time answering - I know a few people within sports science have already messaged me to read these posts.
I think these changes are improvements; but then I would, since I gave my views and the author considered and used them as he saw fit.
Here’s an experiment for which the outcome is likely to be very informative no matter what it is: Ask Welsh, Knight, and Sainani to read this blogpost series, especially my comments below; then look at the updates; and then post their view about whether the MBD idea is moving in the right direction away from MBI toward standard decision theory, and what more would help it do so.
As I see it, the critics assumed “conservative” preferences (preferring the null or anchored around it), which I call nullism, and traditional defaults, as shown their orientation toward CIs (which fix traditional alpha levels). In contrast, the originators assumed “liberal” (wide open to moving far off the null) preferences, which I call agnosticism, and are not wedded to traditional alpha defaults either; but their justification for breaking from tradition was not apparent to the critics.
The justification might be apparent in a fully developed MBD theory based on a decision function, in which alpha is a function of explicit utilities and data distributions under competing models (instead of an input parameter). In this formulation, Bayes rules are admissible, and MBD yields a limit on Bayes rules that is calibrated against the sampling-model frequencies.
This places the theoretical substance of the dispute in the loss functions implicit in the different preferences for alpha levels. These preferences are the irreducible subjective inputs to decision rules, and need to be made explicit because decisions are driven by them. They will vary across stakeholders (those who experience measurable losses or gains from decisions); so, unless a compromise is reached, stakeholders will often disagree on the preferred (“optimal”) decision. This point is recognized in both frequentist and Bayesian decision theories, which parallel one another far more than they diverge.
Whatever one thinks of MBI, it was proposed to solve a real-world statistical problem whose solutions had not been taught enough for its developers to know about. I find it interesting the developers got as close as they did to the standard solutions in form if not in description. I think their effort demonstrates a need for interval testing, and so the theory for that needs to be settled enough to introduce at an elementary level. Unfortunately, as with all things statistical, there is no consensus on all aspects of interval testing - in part because of differences in hidden assumptions about loss or utility functions.
once again @Sander thank you for your answer.
Would I be interpreting you correctly, if I took away the message as MBD as its currently formed (and hence what has been in use) is not fully developed.
Which might likely be why there are a lot of ''problems with MBI/MBD"
Hence it would be best for practicing sports scientists to stay away until a fully developed MBD theory is released in spreadsheet form.
Maybe I am reading too much into things, but from an applied setting it seems as though using a not fully developed theory in practice is only really doomed for findings which are misleading.
Does @f2harrell have any thoughts, I see quite a bit of similarity between sports science and medicine (small samples etc)
Once again thank you, this is great!
@Sander has thought deeply about this and has written clearly, so the only remaining thoughts I have are to wonder why the MBI inventors did not just use existing paradigms with possibly transparently relaxed criteria, and I always push displays of entire posterior distributions.
As has been pointed out by Sander in this thread, the most charitable interpretation of MBI/MBD is that it is interval hypothesis testing with a non-traditional \alpha level. I greatly appreciate his insight into this (and saving me lots of study time!).
There is nothing wrong with breaking the irrational tradition of a fixed level \alpha (it is almost certainly a welcome development), but those who do should be able to provide the appropriate calculations to justify the decision.
I can empathize with the scholars you mentioned in this thread. It takes an extraordinary amount of study time for a statistical outsider to study all of the relevant materials someone educated in statistics formally takes for granted. That has been made much easier by the internet, but if you don’t know the right terminology and prior work done, you will have embarrassing gaps in your knowledge when you talk to actual statisticians. You can’t even benefit from their legitimately earned expertise without some real study, and an intro stats class is not enough.
I can also empathize with the critics. A lot of excellent scholarship is buried in the old journals that should have been referenced in their papers. With a bit more research, they would not have made some indefensible claims that undermined their credibility.
Example: I was aware of the Hodges-Lehmann interval hypothesis testing paper before it being mentioned it in this thread. I had known of (but not studied) the review of SGPV he also referenced here. I am more optimistic on SGPV than others in this thread might be, but I haven’t worked out a full argument yet.
Outsiders who propose a new technique should understand the basic methods of justification of statistical procedures in order to present a rational argument regarding the benefits of the proposal.
Statisticians of whatever philosophy are very smart, and if they tell you something is not correct, you had better go back and check your work and/or your comprehension.
thank you @r_cubed I guess my concern is that it seems that @Sander is saying (I could be wrong) that MBI as was wasn’t correct, and that now MBD (not as currently formed and hence used) might in future be on more solid statistical grounds but currently that is not the case.
The issue is obviously hundreds of studies were done and are continuing to be done using MBI (as formulated and readily downloaded on the creators website) while what it needs to move towards (MBD) isn’t fully formulated in a way that makes experts comfortable.
Without submission to any peer reviewed statistics journal, its very hard from reading the above commentary is what should the takeaway be for sports scientists.
Should we be using what is available online as is, if we have used what is available online as was (MBI) converting p-values to MBD should we stand by those findings?
Once again I greatly appreciate everyones time
Are the primary data available for re-analysis?
If not, I’d remain skeptical of the conclusions as presented by the primary researchers in the narrative section unless I could see the actual data to do my own calculations. But some useful things might still be learned for future experiments by examining the reported descriptive statistics.