I feel that this issue of not calculating and presenting p-values in Table 1 extends to observational studies, for multiple reasons. That said, I am not aware of any published papers that have made this argument.
Thanks for initiating this list. I discovered 3 articles I didn’t know about before that.
Please see below some news references, by topic.
TOPIC: Analyzing “Change” Measures in RCT’s
- Archie J.P. Mathematic coupling of data – A common source of error. Annals of Surgery, 1980, 193: 296-303
- Yanez N.D. et al. The effects of measurement error in response variables and tests of association of explanatory variables in change models. SiM, 1998, 17: 2597-2606.
- Senn S. Change from baseline and analysis of covariance revisited. SiM, 2006, 25: 4334-4344.
- Tu Y-K., et al. Revisiting the relation between change and initial value: A review and evaluation. SiM, 2007. https://doi.org/10.1002/sim.2538
- Braun J., et al. Accounting for baseline differences and measurement error in the analysis of change over time. SiM, 2013. https://doi.org/10.1002/sim.5910
- Tu Y-K. Testing the relation between percentage change and baseline value. ScientificReports, 2016. https://doi.org/10.1038/srep23247
- Clifton et al. Comparing different ways of calculating sample size for two independent means: A worked example. CCT, 2019. https://doi.org/10.1016/j.conctc.2018.100309
TOPIC: Stepwise Variable Selection (Don’t Do It!)
- Heinze G. et al. Variable selection – A review and recommendations for the practicing statistician. BiomJ, 2017. https://doi.org/10.1002/bimj.201700067
- Ahamadi M. et al. Operating characteristics of stepwise covariate selection in pharmacometric modeling. JPKPD, 2019. https://doi.org/10.1007/s10928-019-09635-6
TOPIC: Inappropriately Splitting Continuous Variables Into Categorical Ones
- Weinberg C.R. How bad is categorization? Epidemiology, 1995, 6:345-346.
- Senn S. Disappointing dichotomies. PharmStat, 2003. https://doi.org/10.1002/pst.090
- Chen H. et al. Biased odds ratios from dichotomization of age. SiM, 2007. https://doi.org/10.1002/sim.2737
- VanWalraven C. et al. Leave ‘em alone – why continuous variables should be analyzed as such. Neuroepidemiology 2008, https://doi.org/10.1159/000126908
Hope it will be useful to you all.
Related to power and the measurement of change scores vs. group differences, does anyone know references about the number of measurements (for instance, adding an ‘in-between measurement’) to increase power? Does it matter if you are only going to investigate group differences or is just a pre-measurement enough?
Since this does not fit in with the ‘myths’ topic please start a new topic with appropriate primary and secondary topic and tag choices. Then I’ll remove this one.
TOPICS suggestion: It would be useful to have a reference collection on P value and confidence interval myths
@EpiLearneR You might find this open access paper valuable:
Search for @Sander (Sander Greenland) here, and you will find a lot of excellent papers on these misinterpretations as well as corrective measures. If you read them slowly and are prepared to look up the mathematics you do not know, they will teach you a lot.
Here is a good link to some of the things he has written on this:
This has been added to the sticky at the top now, thanks.
I added another good paper on the p-value misinterpretations by Leonhard Held:
Held, L (2013) Reverse-Bayes analysis of two common misinterpretations of significance tests
Clinical Trials Volume: 10 issue: 2,
Later on I’ll add some of the position papers issued by the American Statistical Association on this issue.
I would add this blog post to the list:
Found a great resource on the fallacies on using parametric methods on ordinal data. Which section of this wiki should it be added?
Thanks, this will be very helpful. I think Grieve’s paper should be on the list of references on NNT.
How about something on the importance of prespecifying primary vs secondary endpoints and handling multiplicity, including hierarchical procedures? However, this may not fit that well under the heading of statistical misconceptions. Rather, (seems to me that) people often don’t appreciate how easily outcome switching/P-hacking/HARKing can occur.
This one is more controversial than it appears. I follow the Cook & Farewell approach of requiring a pre-specified priority ordering of endpoints, but not doing formal multiplicity correction after that.
The following threads had a discussion of this issue. It seems like it could fall under the p value misconception heading, but I think it deserves a section on its own.
I think the Bayesian POV provides a framework for guidance on this. @Sander provided references to Bayesian justifications for adjustment.
Some of his own writing on the issue:
I posted a few references in the context of clinical trials in this thread:
The following is as close to a complete theory of MCPs as we are going to get as the perceived need to adjust is closely related to the ratio of \frac{1-\beta} {\alpha} an experimenter wishes to make, or the plausibility of the hypothesis under consideration.
I found some additional resources to push-back on a decision to use change from baseline. I thought some of these would be useful and could be added to the list:
Manuscripts:
Vickers AJ, Altman DG. Analysing controlled trials with baseline and follow up measurements. BMJ 2001; 323: 1123.
Vickers AJ. The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study. BMC Med Res Methodol. 2001;1:6. doi: 10.1186/1471-2288-1-6. Epub 2001 Jun 28. PMID: 11459516; PMCID: PMC34605. (The use of percentage change from baseline as an outcome in a controlled trial is statistically inefficient: a simulation study - PubMed)
Bland JM, Altman DG. Comparisons against baseline within randomised groups are often used and can be highly misleading. Trials. 2011 Dec 22;12:264. doi: 10.1186/1745-6215-12-264. PMID: 22192231; PMCID: PMC3286439. (Comparisons against baseline within randomised groups are often used and can be highly misleading - PubMed)
Bland JM, Altman DG. Best (but oft forgotten) practices: testing for treatment effects in randomized trials by separate analyses of changes from baseline in each group is a misleading approach. Am J Clin Nutr. 2015 Nov;102(5):991-4. doi: 10.3945/ajcn.115.119768. Epub 2015 Sep 9. PMID: 26354536. (Best (but oft forgotten) practices: testing for treatment effects in randomized trials by separate analyses of changes from baseline in each group is a misleading approach - PubMed)
An interesting response to the paper above:
Stanhope KL, Havel PJ. Response to “Best (but oft forgotten) practices: testing for treatment effects in randomized trials by separate analyses of changes from baseline in each group is a misleading approach”. Am J Clin Nutr. 2016 Feb;103(2):589. doi: 10.3945/ajcn.115.125989. PMID: 26834111. (Response to "Best (but oft forgotten) practices: testing for treatment effects in randomized trials by separate analyses of changes from baseline in each group is a misleading approach" - PubMed)
Leo Törnqvist, Pentti Vartia & Yrjö O. Vartia (1985) How Should Relative Changes be Measured?, The American Statistician, 39:1, 43-46, DOI: 10.1080/00031305.1985.10479385
(https://www.tandfonline.com/doi/abs/10.1080/00031305.1985.10479385)
Kaiser L. Adjusting for baseline: change or percentage change? Stat Med. 1989 Oct;8(10):1183-90. doi: 10.1002/sim.4780081002. PMID: 2682909. (Adjusting for baseline: change or percentage change? - PubMed)
Blogs/ Online Resources:
Magnusson K. Change over time is not “treatment response”. Rpsychologist blog 2018. https://rpsychologist.com/treatment-response-subgroup
Harrell F. Statistical Errors in the Medical Literature: Change from Baseline. Statistical Thinking 2021 (last updated). (Statistical Errors in the Medical Literature | Statistical Thinking)
Interactive Simulation by Frank Harrell:
Harrell F. Transformations, Measuring Change, and Regression to the Mean. BBR Ch 14. (Transformations)
Great Discussion on QOL Analysis and Change from Baseline:
davidcnorrismd. Exemplary QOL analyses that avoid change-from-baseline blunders? Data methods 2019. (Exemplary QOL analyses that avoid change-from-baseline blunders?)
Twitter Post by Stephen Senn on change from baseline:
(https://twitter.com/stephensenn/status/1224362916423573504)
Text Books:
Harrell F, Slaughter J. Change from Baseline in Randomized Studies. Biostatistics for Biomedical Research. Chapter 14.4.1. (https://hbiostat.org/doc/bbr.pdf)
Senn S. Baselines and Covariate Information. Statistical Issues in Drug Development. Chapter 7.
Might I gingerly add to textbooks: Statistical aspects of the design and analysis of clinical trials (revised edn). Brian S. Everitt and Andrew Pickles, Imperial College Press, London, 2004. [Section 5.1] against the use of Change scores in RCTs and with a nice, didactic comparison of methods to incorporate baseline data, referring to some above-cited material :
-
Senn, S. (1994a). Repeated measures in clinical trials: analysis using mean summary statistics and its implications for design, Statist. Med. 13:
-
Senn, S. (1994b). Testing for baseline balance in clinical trials, Statist.
Med. 13: 1715-1726.
I just came across a recently published commentary by Horwitz et al. discussing “Table 1”. [I haven’t added it to the wiki post because it doesn’t quite fit in with that topic title.]
In the commentary, the authors argue that more “biographical variables” (in addition to “biological variables”) should be included in Table 1. For example, they write that race and/or ethnicity are not “strictly biological”. They cite another commentary which suggests other “social and behavioral determinants of health” (e.g., social isolation) that should be recorded/measured.
While the commentary doesn’t appear to give specific recommendations – it seems more like a “call to action” – its title caught my eye and I thought it could be of interest to epidemiologists and others on this forum.
- Horwitz, R.I., Lobitz, G., Mawn, M., Conroy, A.H., Cullen, M.R., Sim, I. and Singer, B.H., 2022. Rethinking Table 1. Journal of Clinical Epidemiology, 142, 242-245.
- Adler, N.E. and Stead, W.W., 2015. Patients in context–EHR capture of social and behavioral determinants of health. The New England Journal of Medicine , 372 (8), 698-701.
P.S. If the moderators think this would be more appropriate as a separate topic, that is OK with me. I don’t know if it “pushes back against common myths”, but it does argue for a change in practice.
Here is another paper I found as a reference to say that “post-hoc power” is a nonsensical concept:
Fraser, R.A. Inappropriate use of statistical power. Bone Marrow Transplant (2023). Inappropriate use of statistical power | Bone Marrow Transplantation
Unfortunately, the article also writes about the difference between frequentist confidence and Bayesian credibility intervals, but to my understanding both, authors and editors get the interpretation wrong (frequentist interpretation is thought of as limiting relative frequency of an event, what is refuted given the “Arguments against frequentism” from Hájek, and the Bayesian interpretation is seen as being related to subjective probabilities assiened to events (not to parameters)).
Using within-group tests in parallel-group randomized trials is not recommedend. Does this also apply to cross-over randomized trials? E.g., analyzing pre-post change in Condition 1 and pre-post change in Condition 2.
The following paper discussed Table 1 in observational studies, which talked about the use of p-values. Hope it helps others interested in this.
Who is in this study, anyway? Guidelines for a useful Table 1
The appropriateness of including a column containing inferential statistics (e.g. p-values) is a topic of some controversy. Statistical testing of distributions of variables (e.g. between exposed and unexposed) is common and even occasionally required by journals;1,6,9,10 although this is a tempting way to assess confounding, it is not best practice. Statistical significance is often misunderstood: non-significance of a p-value does not indicate that no difference in the distribution of a variable exists, and significance does not mean that the difference is meaningful or that the difference indicates presence of confounding.10–13 As a result, confounder assessment should not be based on p-values (Figure 2, Point 3; Figure 3, Point 4).1,2 Rather, authors should consider whether the relationship between the exposure and hypothesized confounders is as expected according to the causal theory, and consider whether the magnitude of an observed difference for a potential confounder represents a meaningful difference.1,9,10 Similarly, when considering external validity, statistical tests are not a helpful way to assess meaningful differences between source and target populations.