Reference Collection to push back against "Common Statistical Myths"


Not sure if I can edit the entry directly but I found this to be helpful:

Gary King on Why Propensity Scores Should Not Be Used for Matching

@f2harrell can comment but there may be a restriction that prevents one from editing unless you have contributed to the forum before, or posted a certain number of times (a bot/quality control issue, I think).

I’ll add this to the wiki, though. Thanks!

1 Like

Fantastic thread. I can’t edit at the moment because I’m a new user, but under TOPIC: Misunderstood “Normality” Assumptions this paper might be relevant:

Williams, M. N., Grajales, C. A. G., & Kurkiewicz, D. (2013). Assumptions of multiple regression: Correcting two misconceptions. Practical Assessment, Research & Evaluation, 18(11).

(Excuse the self-promotion!)


What an incredibly useful post. There must be a ‘bravo’ emoji, but I am for sure far too old to know where to find it.


Thanks Ronan - please feel free to add your own suggestions or references!

This thread should be highlighted on Twitter and other social media platforms. Perhaps on Facebook Psychological Methods Discussion Page.

Since I haven’t posted before, I don’t think I’m able to directly edit the wiki, but I wanted to provide a nice pair of references that might be valuable additions to the propensity score matching section resources!

Brooks JM, Ohsfeldt RL. Squeezing the balloon: propensity scores and unmeasured covariate balance. Health services research. 2013 Aug;48(4):1487-507.


Ali MS, Groenwold RH, Klungel OH. Propensity score methods and unobserved covariate imbalance: comments on “squeezing the balloon”. Health services research. 2014 Jun;49(3):1074-82.


I stumbled on this article about issues with categorization (responder analysis):

Ideally, a clinical trial should be able to demonstrate not only a statistically significant improvement in the primary efficacy endpoint, but also that the magnitude of the effect is clinically relevant. One proposed approach to address this question is a responder analysis, in which a continuous primary efficacy measure is dichotomized into “responders” and “non-responders.” In this paper we discuss various weaknesses with this approach, including a potentially large cost in statistical efficiency, as well as its failure to achieve its main goal. We propose an approach in which the assessments of statistical significance and clinical relevance are separated.


I agree with Dr. Harrel. But I think an argument for adjustment could be made, in addition to the known gain in precision in the estimate of effect. Randomizing treatments is not a full proof method. Even if you randomize a million patients, there is no guarantee the potential outcomes will be the same in treated and “untreated”. It would be very unlikely if the potential outcomes are not very similar, but unlikely/rare things do happen. They are bound to happen due to the very nature of randomization. If we put aside issues of variable selection and how variables will be modelled, adjusting will provide evidence about the exchangeability of the treatment groups beyond the evidence provided in the traditional Table 1 comparing prognostic factors in treated and untreated. Even if each prognostic factor in Table 1 is balanced, this does no imply combinations of multiple prognostic factors are also balanced. In other words, prognostic factors and treatment may not be associated in a crude analysis (presented in Table 1), but may be associated in a multivariate analysis (never presented). To avoid conscious or unconscious manipulation of the analysis, we could decide on what variables we would adjust for pre-facto, as part of the study protocol. Actually, what we report in Table 1 is a list of the variables we believe we should adjust for. These variables could be selected using the same substantive-based approaches we use in observational studies. There doesn’t seem to be a methodological reason for adjusted effect estimates from RCT to be more biased than crude estimates (again, assuming modeling assumptions are correct). In most cases, particularly in mid-size and small trials, the validity of the estimate of the effect of the treatment will be enhanced, and credibility of the RCT findings would increase, if crude and adjusted estimates are consistent.

1 Like

This is described in the “Table one” topic where it is shown that even if you don’t bother to measure any covariates the inference is sound (though not efficient). So I can’t say I agree with this angle on the problem.

I’ll add that to the separate responder analysis “loser x4” topic. Great paper.

I do not argue the non-ajusted estimates are biased. I argue that in “small” and “moderate” size the exchangeability of treatment arms may be compromised and that small differences in several prognostic factors could lead to significant bias in the estimate of effect. This can not be appreciated in univariate comparisons of the distribution of prognostic factors across treatment groups, which is what is presented in Table 1. Therefore, if I see small differences in several prognostic factors or if I see a large difference in a single prognostic factor, I would present crude and adjusted estimates, and would give more weight to the adjusted one, for the purpose of inferences, if they are different. I also argue that even in the case of “large” trials, adjusting would not introduce bias. This is a direct consequence of the independence between treatment assigned and potential outcome that results from randomization. Therefore, if adjusted and crude estimates differ in a large trial, I’d be inclined to believe something was wrong with the model used for the adjustment. Briefly, there is nothing wrong with adjusting for prognostic factors in a RCT, either from the perspective of precision or bias, unless the model used for the adjustment is misspecified.

Great post
I wonder if the first topic could be broadened to also apply to observational “table one’s” such as descriptives of baseline data in different exposure groups in a cohort study ? The STROBE criteria argue against significance testing.

Would be interesting to hear your thought also on observational studies

1 Like

Added topic on significance testing in pilot studies with some useful references, feel free to expand.

1 Like

I feel that this issue of not calculating and presenting p-values in Table 1 extends to observational studies, for multiple reasons. That said, I am not aware of any published papers that have made this argument.


Thanks for initiating this list. I discovered 3 articles I didn’t know about before that.
Please see below some news references, by topic.

TOPIC: Analyzing “Change” Measures in RCT’s

  • Archie J.P. Mathematic coupling of data – A common source of error. Annals of Surgery, 1980, 193: 296-303
  • Yanez N.D. et al. The effects of measurement error in response variables and tests of association of explanatory variables in change models. SiM, 1998, 17: 2597-2606.
  • Senn S. Change from baseline and analysis of covariance revisited. SiM, 2006, 25: 4334-4344.
  • Tu Y-K., et al. Revisiting the relation between change and initial value: A review and evaluation. SiM, 2007.
  • Braun J., et al. Accounting for baseline differences and measurement error in the analysis of change over time. SiM, 2013.
  • Tu Y-K. Testing the relation between percentage change and baseline value. ScientificReports, 2016.
  • Clifton et al. Comparing different ways of calculating sample size for two independent means: A worked example. CCT, 2019.

TOPIC: Stepwise Variable Selection (Don’t Do It!)

TOPIC: Inappropriately Splitting Continuous Variables Into Categorical Ones

Hope it will be useful to you all.


Related to power and the measurement of change scores vs. group differences, does anyone know references about the number of measurements (for instance, adding an ‘in-between measurement’) to increase power? Does it matter if you are only going to investigate group differences or is just a pre-measurement enough?

Since this does not fit in with the ‘myths’ topic please start a new topic with appropriate primary and secondary topic and tag choices. Then I’ll remove this one.

TOPICS suggestion: It would be useful to have a reference collection on P value and confidence interval myths

1 Like

@EpiLearneR You might find this open access paper valuable:

Search for @Sander (Sander Greenland) here, and you will find a lot of excellent papers on these misinterpretations as well as corrective measures. If you read them slowly and are prepared to look up the mathematics you do not know, they will teach you a lot.

Here is a good link to some of the things he has written on this:

1 Like