Another bad statistical method proposed in sports and exercise science

Sports and exercise scientists are becoming increasingly concerned about statistics, and this in part may be due to recent criticism of ‘shoddy’ statistics in sport science (see “magnitude based inference”). One area that is of particular interest in sports and exercise science is classifying individuals in training studies (i.e., longitudinal studies that employ an exercise training intervention) as being “differential responders”—e.g., low, average, and high responders. What is worse, the purpose of discretizing is often for subsequent analyses, with the goal of determining what mechanistic/physiological variables are associated with greater adaptations (example below).

A recent article in Sports Medicine—a prestigious journal in sports and exercise science—describes a new “method” to “account for random error” (measurement error & biological variability) when creating these subgroups. Briefly, the method is as follows:

  1. Use Levene’s test to see if response variances are statistically significantly greater than a control group.
  2. If Levene’s test is statistically significant, use mean experimental change ± 1.96*SD_control as limits for “average responders” in the experimental group. Those above this value are “high responders” and those below it are “low responders.”

The authors purport that this method has a 5% error rate with regards to classifying “average responders” as “high” or “low” responders. Thus, the authors believe that the “high” and “low” groups are more robust to random error than standard approaches (e.g., clustering or median splits) and should be used to establish groups for subsequent t-tests or ANOVAs.

Clearly, there are several issues with this approach. It fails philosophically and, perhaps more concretely, it fails statistically. The tacit assumption of homogeneous random error is not necessarily justified for many exercise science measurement instruments, and even when error is homogeneous, is assumes the error (from a joint distribution, observed vs. true values in the experimental group) can be extrapolated from a different distribution (i.e., marginal distribution of observed scores in the control group). It seems that, if error is a concern, a more efficient and philosophically grounded approach would be to simply use an error-in-variables model and treat all data continuously.

My colleagues and I drafted a letter-to-the-editor, but instead of submitting it, we decided to correspond with the authors for clarification, in addition to seeing if they had evidence (i.e., proofs or simulation) to support their stance. The authors provided little more than their thought process and failed to provide evidence to support their assertions. We asked the authors to retract their paper, but they refused.

Before taking this to the next step, we wanted to ask people on this forum (statisticians and methodologists) if they can provide feedback regarding the letter, simulations, and mathematical evidence that we’ve generated to combat this method.

Original paper:

Draft of letter:
Rmarkdown documents of our simulations and proofs can be found within the OSF project.

Example of how exercise science researchers are splitting groups for subsequent analyses: An investigator may carry out a training study in which participants perform resistance training over the course of a couple of months, with the primary outcome being muscle hypertrophy (growth). Based on individual change scores, investigators will create groups of differential responders (e.g., using cluster analysis), which are used as independent variables in an ANOVA. This ANOVA may look at, for example, gene expression in muscle as a dependent variable. Of course, this approach is vapid as there is no reason not to treat the data continuously; a rationale for distinct taxa is often nonexistent.




It seems as if classical hypothesis testing gets imprinted in the minds of researchers, that they cannot conceive that other methods exist.

I’ve seen this use of Levene’s test before in the physical rehabilitation literature. These are improper “adaptive” procedures. We do not know the size (\alpha) or the power ( 1-\beta ) in this combined procedure. A proper adaptive procedure can give asymptotic guarantees regarding size (O’Gorman 2004, O’gorman 2012). To do simulations of possible results, we would have to make some assumptions about the magnitude of the effect (ie differences in variances), and that just defeats the whole point of using a frequentist analysis in the first place, especially if there is no simple, linear relationship between variance and predicting response. Some sort of regression model would be better.


Looks OK to me, though I would probably try to be a bit more constructive and less combative. Those are all very subjective categories so I don’t think it is sensible to discuss my suggestions thoroughly here, just take them as my opinion and either accommodate them or not, as you see fit :slight_smile:

  • I would prefer to move the overall framing from “This method is bad” to “What we believe researchers should do (probably not this method)”
  • It would be great if one of the supplements would show code and results for reanalysis of the dataset from the D&L paper or any other (can’t access the paper so not sure if theirs is a good dataset) the way you believe it “should be done” or if you provided a link to some resource showing the preferable method in practice
  • Do you agree with D&L that their method is indeed better (at least in some sense) than “percentile ranks, standard deviations from the mean, and cluster analyses”? If so and if those are actually used in practice it would be nice to give D&L credit for raising the bar at least a bit.
  • Depending on the results of the above it might be more sensible to ask for correction rather than retraction (remove claims about 5% error rate and similar but keep the claim that it is better than standard practice).

Very Minor Formal Suggestions

  • Figure 1 is IMHO not very friendly to people with visual impairment. I would change brightness as well as hue over the gradient. I would also put a contour line at 0.05 misclassified.
  • The link to OSF should be somewhere prominent (definitely before first mentioning documents on the OSF)
  • Some Readme.MD or similar file describing which file contains what would IMHO be useful to have at the OSF

Thank you for the feedback.

To clarify a few things,

  1. We think the method performs poorly in any scenario and does not necessarily constitute an improvement to any currently used method even though they may be equally bad. Specifically, while other methods (e.g., median splits) may have greater “fuzziness” in the groups that they create, they are also more efficient because they do not throw away participants—there’s a tradeoff. In either case, we find the overall concept of categorizing subjects for analyses like these to be philosophically ungrounded and statistically inefficient.
  2. We’re working on tutorial that describes other analytical approaches, but in general, we have concerns about responder analysis (especially with the small sample sizes in exercise science).

While we want to be firm in our stance that this method performs poorly, we certainly do not want to be combative or demeaning. (We will note that, in our correspondence with the authors, we offered to write a new paper with them that describes more established approaches (e.g., EIV models).) If there is anything in particular that comes off as condescending or combative, please let us know and we are happy to update the text to be more appropriate.

In the meantime we will work on the minor points!