Sports and exercise scientists are becoming increasingly concerned about statistics, and this in part may be due to recent criticism of ‘shoddy’ statistics in sport science (see “magnitude based inference”). One area that is of particular interest in sports and exercise science is classifying individuals in training studies (i.e., longitudinal studies that employ an exercise training intervention) as being “differential responders”—e.g., low, average, and high responders. What is worse, the purpose of discretizing is often for subsequent analyses, with the goal of determining what mechanistic/physiological variables are associated with greater adaptations (example below).
A recent article in Sports Medicine—a prestigious journal in sports and exercise science—describes a new “method” to “account for random error” (measurement error & biological variability) when creating these subgroups. Briefly, the method is as follows:
- Use Levene’s test to see if response variances are statistically significantly greater than a control group.
- If Levene’s test is statistically significant, use mean experimental change ± 1.96*SD_control as limits for “average responders” in the experimental group. Those above this value are “high responders” and those below it are “low responders.”
The authors purport that this method has a 5% error rate with regards to classifying “average responders” as “high” or “low” responders. Thus, the authors believe that the “high” and “low” groups are more robust to random error than standard approaches (e.g., clustering or median splits) and should be used to establish groups for subsequent t-tests or ANOVAs.
Clearly, there are several issues with this approach. It fails philosophically and, perhaps more concretely, it fails statistically. The tacit assumption of homogeneous random error is not necessarily justified for many exercise science measurement instruments, and even when error is homogeneous, is assumes the error (from a joint distribution, observed vs. true values in the experimental group) can be extrapolated from a different distribution (i.e., marginal distribution of observed scores in the control group). It seems that, if error is a concern, a more efficient and philosophically grounded approach would be to simply use an error-in-variables model and treat all data continuously.
My colleagues and I drafted a letter-to-the-editor, but instead of submitting it, we decided to correspond with the authors for clarification, in addition to seeing if they had evidence (i.e., proofs or simulation) to support their stance. The authors provided little more than their thought process and failed to provide evidence to support their assertions. We asked the authors to retract their paper, but they refused.
Before taking this to the next step, we wanted to ask people on this forum (statisticians and methodologists) if they can provide feedback regarding the letter, simulations, and mathematical evidence that we’ve generated to combat this method.
Original paper: https://www.ncbi.nlm.nih.gov/pubmed/31254258
Draft of letter: https://osf.io/6ezyp/
Rmarkdown documents of our simulations and proofs can be found within the OSF project.
Example of how exercise science researchers are splitting groups for subsequent analyses: An investigator may carry out a training study in which participants perform resistance training over the course of a couple of months, with the primary outcome being muscle hypertrophy (growth). Based on individual change scores, investigators will create groups of differential responders (e.g., using cluster analysis), which are used as independent variables in an ANOVA. This ANOVA may look at, for example, gene expression in muscle as a dependent variable. Of course, this approach is vapid as there is no reason not to treat the data continuously; a rationale for distinct taxa is often nonexistent.