As I pointed out above, Romain, MBI has turned out to be equivalent to two one-sided interval hypothesis tests, with not unreasonable alphas, so MBI/MBD has a solid frequentist footing. The hypotheses for clinical MBI are that the effect has harm and benefit, or substantial positive and negative values for non-clinical MBI. Rejection of both of these hypotheses, which is what you want to do in the worst-case scenario of the true effect being trivial, sets the minimum desirable sample size. If only one of the hypotheses is rejected, the evidence for the other hypothesis, which in MBI is expressed in a Bayesian fashion, is also equivalent to testing an hypothesis (that the effect does not have that magnitude). MBI became MBD when Sander objected to the use of the word inference. I provide an argument for the Bayesian interpretation at the end of this message.
MBx is the version of MBI/MBD that Janet and Daniel are working on. They have used frequentist terms to describe the outcomes of the hypothesis tests, which I am happy to incorporate in frequentist versions of my spreadsheets. A point of contention at the moment is justification of sample size. They have tied sample-size estimation to an error rate or power defined by minimum-effects testing (MET), which involves an hypothesis about a non-substantial effect with a threshold for substantial that is somewhat larger than (e.g., 2x) the smallest important magnitude. I disagree with this approach, for several reasons. First, the somewhat larger magnitude is arbitrary, so it seems to me that it’s just shifting the goal-posts. Secondly, when a non-clinical true effect is the smallest important and the alpha for the test of the non-substantial hypothesis is 0.05 (substantial now referring to the smallest important, not 2x the smallest important), you will reject the hypothesis and thereby “discover” the effect only 5% of the time, whatever the sample size; that is, your power will be only 5%. (This low power does not apply to clinical MBI/MBD, since failure to reject the hypothesis of benefit is sufficient evidence to consider the effect as potentially implementable, if you want a reasonably low rate of failure to implement small beneficial effects.) Finally, the error rates in MBI/MBD are reasonable anyway, as Alan and I showed in our Sports Medicine paper. They have been misrepresented previously.
We have not yet had a decisive discussion of the Bayesian interpretation. I don’t think anyone apart from Daniel will object to use of a full informative-prior Bayesian analysis of the effect when one of the substantial hypotheses has not been rejected. That then raises an interesting question: can the posterior probabilities of magnitude be interpreted qualitatively, as I have done in MBI, with terms such as possibly, likely, and so on? The Intergovernmental Panel on Climate Change thinks it’s a good idea to use such terms to describe probabilities, and my scale is actually a bit more conservative than theirs. If we can use such terms, that raises a further crucial issue. Data from a study with a large-enough sample size overwhelm any informative prior, in which case the sampling distribution of the effect is the same as the posterior distribution of the true effect, and therefore the “inverse-probability” mantra should not be uttered to dismiss interpretation of the sampling distribution as the distribution of the true effect. Hence, if a researcher with only a small sample size opts for a realistic weakly informative prior that makes no practical difference to the posterior, the researcher should be entitled to use the terms possibly, likely, and so on. I have shown, using Sander’s semi-Bayes approach, that realistic weakly informative priors make no practical difference with the usual small sample sizes in sport and exercise science, so the original Bayesian interpretations of MBI hold up. That said, I do need to make researchers aware of the importance of checking that the weakly informative prior does indeed make no difference with their data before they use the probabilistic terms to describe the effect, and if the prior does make a difference, then they should use probabilistic terms provided by the posterior. My realistic weakly informative priors are normally distributed with 90% confidence or credibility intervals equal to thresholds for extremely large effects: 0.1 < hazard ratio < 10; -4.0 < Cohen’s d < 4.0; -0.9 < Pearson correlation < 0.9.