Note: This topic is a wiki, meaning that this main body of the topic can be edited by others. Use the Reply button only to post questions or comments about material contained in the body, or to suggest new statistical myths you’d like to see someone write about.
I am not claiming to be the leading authority on any or all of the things listed below, but several of us on Twitter have repeatedly floated the idea of creating a list of references that may be used to argue against some common statistical myths or no-nos.
By posting in thread format, I will start but others should feel free to chime in. However, it MAY be easier if the first post (or one of the first posts) is continually updated so all references on a particular topic are indexed at the top. I am not sure of the best way to handle this - either I will try to periodically edit the first post to keep the references updated, or maybe Frank will have a better idea for how to structure and update this content.
I was hoping to organize this into a few key myths/topics. While I am happy to add any topic that authors think is important, the intent here is not to recreate an entire statistical textbook. I’m hoping to provide an easy to navigate list of references so when we get one of the classic review comments like “the authors should add p-values to Table 1” we have some rebuttal evidence that’s easy to find.
I’ve listed a few below to start. Please feel free to email, Twitter DM, or comment below. This will be a living ‘document’ so if there’s something you think is missing, or if I cite a paper that you feel has a fatal flaw and does not support its stated purpose, let me know. We’ll see how this goes.
Reference collection on P value and confidence interval myths
https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108
https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913
Are confidence intervals better termed “uncertainty intervals”? - PubMed (nih.gov)
Reverse-Bayes analysis of two common misinterpretations of significance tests
https://amstat.tandfonline.com/doi/full/10.1080/00031305.2018.1529625
P-Values in Table 1 of Randomized Trials
Rationale: In RCT’s, it is a common belief that one should always present a table with p-values comparing the baseline characteristics of the randomized treatment groups. This is not a good idea for the following reasons.
-
Senn SJ. Baseline comparisons in randomized clinical trials. Stat Med 1991; 10: 1157-1160
-
Senn SJ. Testing for baseline balance in clinical trials. Stat Med 1994; 13: 1715-1726.
Covariate Adjustment in RCT
Rationale: Somewhat related to the above, many consumers of randomized trials believe that there is no need for any covariate adjustment in RCT analyses. While it is true that there is no need for a valid RCT, there are benefits to adjusting for baseline covariates that have strong relationships with the study outcome, as explained by the references below. If a reader/reviewer questions why you have chosen to adjust, these may prove helpful.
Analyzing “Change” Measures in RCT’s
Rationale: Many authors and pharmaceutical clinical trialists make the mistake of analyzing change from baseline instead of making the raw follow-up measurements the primary outcomes, covariate-adjusted for baseline. To compute change scores requires many assumptions to hold (for more detail, see Frank’s blog post on this: Statistical Thinking - Statistical Errors in the Medical Literature). It is generally better to analyze the “follow up” measurement as the outcome with a covariate adjustment for the baseline value, as this seems to better match the question of interest: for two patients with the same pre-trial value of the study outcome, one given treatment A and the other treatment B, will the patients tend to have different post-treatment values?
Using Within-Group Tests in Parallel-Group Randomized Trials
Rationale: Researchers often analyze randomized trials and other comparative studies by separate analysis of changes from baseline in each parallel group. Sometimes, they will incorrectly conclude that their study proves that a treatment effect exists if there is a “significant” p-value for the within-group test for the treatment group, although this ignores the presence of the control group (what’s the purpose of having the control group if you’re not going to compare the treated group against the control group?)
Sample Size / Number of Variables for Regression Models
Rationale: It is common to see regression models with far too many variables included relative to the amount of data (as a reviewer, I’ll see papers that report a “risk score” that includes 20+ variables in a logistic regression model with ~200 patients and ~30 outcome events). A commonly cited rule of thumb is “10 events per variable” in logistic regression, but in fact the specific number is more complex than that, though it may function as a useful “BS test” at a first glance.
Stepwise Variable Selection (Don’t Do It!)
Rationale: Though stepwise selection procedures are taught in many introductory statistics courses as a way to make multivariable modeling easy and data-driven, statisticians generally dislike it for several reasons, many of which are explained in the reference below:
Screening covariates to include in multivariable models with bivariable tests
Rationale: People sometimes decided to include variables in multivariable models only if they are “significant” predictors of the outcome when included in the model by themselves (i.e. they are crudely associated with the outcome). This is a bad idea, partly due to the same reasons as stepwise regression (it is essentially a variant of stepwise regression done manually), partly due to the fact that it neglects the multivariate structure, and a variable’s effect on the outcome might be different when viewed in isolation and when several variables are considered simultaneously.
Post-Hoc Power (Is Not Really A Thing)
Rationale: In studies that fail to yield “statistically significant” results, it is common for reviewers, or even editors, to ask the authors to include a post hoc power calculation. In such situations, editors would like to distinguish between true negatives and false negatives (concluding there is no effect, when there actually is an effect, and the study was just too small to pick it up). However, reporting post-hoc power is nothing more than reporting the p-value a different way, and will therefore not answer the question editors want to know.
Misunderstood “Normality” Assumptions
Rationale: One of the pieces of information that many folks who have taken an introductory statistics class retain is that the Normal distribution is basically everything, and they often assume that the data need to be normally distributed for ALL statistical procedures to work correctly. However, in many of the procedures and tests that we use, the normality of the error terms (or residuals) matters, not the normality of the data points themselves.
Absence of Evidence is Not Evidence of Absence
Rationale: It’s also common to break results into “significant” (p<0.05) and “not significant” (p>0.05); when the latter occurs, many interpret the phrase “no significant effect” as evidence that there is no effect, when this is not really true (thanks to @davidcnorrismd for adding another reference below).
Inappropriately Splitting Continuous Variables Into Categorical Ones
Rationale: People often choose to split a continuous variable into dichotomized groups or a few bins (e.g., using quartiles to divide the data into four groups, then comparing the highest versus the lowest quartile). There are select and limited reasons why one may choose to partition continuous variables into categories, but more often than not this is a bad idea and done simply because it’s believed to be “easier” to perform or understand.
Use of normality tests before t tests
Rationale: This is commonly recommended to researchers who are not statisticians.
The assumption of normality needs to be checked for many statistical procedures, namely parametric tests, because their validity depends on it.
Problem: The nominal size and power of the unconditional t test is changed with the combined procedure in unknown ways.
I2 in meta-analysis doesn’t refer to an absolute measure of heterogeneity
Rationale: When reporting heterogeneity results in a meta-analysis the value of I2 is often misinterpreted and treated like an absolute measure of heterogeneity when in fact is not.
Number Needed to Treat (NNT)
Propensity-Score Matching - Not Always As Good As It Seems
Rationale: Conventional covariates adjustment is enough for most cases with adequate sample size, and propensity-score matching is not necessarily superior.
Responder Analysis
Rationale: In some cases, authors attempt to dichotomize a continuous primary efficacy measure into “responders” and “non-responders." This is discussed at length in another thread on this forum, but here are some scholarly references:
Significance testing in pilot studies
Rationale: Authors often perform null-hypothesis testing in pilot studies and report p-values. However, the purpose of pilot studies is to identify issues in all aspects of the study ranging from recruitment to data management and analysis. Pilot studies are not usually powered for inferential testing. If testing is done, p-values should not be emphasized and confidence intervals should be reported. Results on potential outcomes should be regarded as descriptive. A CONSORT extension for pilot and feasibility studies exist, and is a useful reference to include in submissions and cover letters. Editors may not be aware of this extension of CONSORT.
P-values do not “trend towards significance”
Rationale: It is common for investigators to observe a “non-significant” result and say things like the result was “trending towards significance”, suggesting that had they only been able to collect more data, surely the result would have been significant. This misunderstands the volatility of p-values when there is no effect of the treatment under test. Simply put, p-values don’t “trend”, and “almost significant” results are not guaranteed to become significant with more data - far from it.
Additional Requested Topics
Feel free to add your own suggestion here, we are happy to revisit and update whenever practical.