References | Ten Simple Rules | Checklist to Avoid p-Hacking | Research Reliability and Publishing | Guidelines for Reporting of Statistics | Guidelines for Figures and Tables | Statistical Myths | Video
This list is not all-inclusive, and nominations for additional entries are welcomed. The list is not meant to imply that a lead author should wait until the end to correct statistical errors, especially with regard to design flaws. To make your research reproducible, you need a statistical analysis plan to be completed before starting any data analysis that reveals patterns about which you are hypothesizing. The plan is best developed with close collaboration of a statistician. One other reason that collaborating with a statistician up front is important is the high frequency with which statisticians find flaws in the response variable chosen by the investigator, with regard to information content, power, precision, and the more subtle issue of finding that the response variable’s definition depends on another variable, e.g., when the response variable has different meanings for different subjects.
If a study is designed to detect a certain effect size with a given power, the effect size should never be the observed effect from another study, which may be estimated with error and be overly optimistic. The effect size to use in planning should be the clinically or biologically relevant effect one would regret missing. Usually the only information from prior studies that is useful in sample size estimation are (in the case of a continuous response variable with a symmetric distribution) estimates of the standard deviation or the correlation between two measurements on the same subject measured at two different times, or (in the case of a binary or time to event outcome) event probabilities in control subjects. For more about choice of effect sizes see this.
Many researchers use Cohen’s standardized effect sizes in planning a study. This has the advantage of not requiring pilot data. But such effect sizes are not biologically meaningful and may hide important issues as discussed by Lenth. Studies should be designed on the basis of effects that are relevant to the investigator and human subjects. If, for example, one plans a study to detect a one standard deviation (SD) difference in the means and the SD is large, one can easily miss a biologically important difference that happened to be much less than one SD in magnitude. Note that the SD is a measure of how subjects disagree with one another, not a measure of an effect (e.g., the shift in the mean).
Categorizing continuous predictor or response variables into intervals, as detailed here, causes serious statistical inference problems including bias, loss of power, and inflation of type I error.
Some analysts use tests or graphics for assessing normality in choosing between parametric and nonparametric tests. This is often the result of an unfounded belief that nonparametric rank tests are not as powerful as parametric tests. In fact on the average nonparametric tests are more powerful than their parametric counterparts, because data are non-normally distributed more often than they are Gaussian. At any rate, using an assessment of normality to choose a test relies on the assessment having nearly perfect sensitivity. If a test of normality has a large type II error, there is a high probability of choosing the wrong approach. Coupling a test of normality to a final nonparametric vs. parametric test only has the appearance of increasing power. If the normality test has a power of 1.0 one can, for example, improve on the 0.96 efficiency of the Wilcoxon test vs. the t -test when normality holds. However, once the uncertainty of the normality test is accounted for, there is no power gain.
When all that is desired is an unadjusted (for other variables) P -value and a parametric test is used, the resulting inference will not be robust to extreme values, will depend on how the response variable is transformed, and will suffer a loss of power if the data are not normally distributed. Parametric methods are more necessary when adjusting for confounding or for subject heterogeneity, or when dealing with a serially measured response variable.
When one wants a unitless index of the strength of association between two continuous variables and only wants to assume that the true association is monotonic (is always decreasing or always increasing), the nonparametric Spearman’s rho rank correlation coefficient is a good choice. A good nonparametric approach to getting confidence intervals for means and differences in means is the bootstrap.
Use of Fisher’s “Exact” test
Fisher’s “exact” test is called “exact” only because its type I assertion probability is guaranteed to not exceed the nominal \alpha level. But it tends to result in a probability that is below that level, often by meaningful amounts. Thus its p-values are too large and power suffers. Fisher’s test also conditions on the margins, e.g. conditions on the total number of outcome events, a conditioning that would be more appropriate for case-control studies than cohort studies. It is a myth that the ordinary Pearson \chi^2 test is not very accurate. In fact for the 2\times 2 table it is very accurate as long as expected frequencies exceed 1.0 if N is replaced by N-1 in the Pearson \chi^2 formula.
Inappropriate Descriptive Statistics
The mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3-number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subject-to-subject variability (it is an interval containing half the subjects’ values). The median is always descriptive of “typical” subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don’t provide descriptive statistics such as “the mean hospital cost was $10,000 plus or minus $20,000.” Nonparametric bootstrap confidence intervals will prevent impossible values being used as confidence limits.
When using the Wilcoxon-Mann-Whitney test for comparing two continuous or ordinal variables, use difference estimates that are consistent with this test. The Wilcoxon test does not test for difference in medians or means. It tests whether the Hodges-Lehmann estimate of the difference between two groups is zero. The HL-estimate is the median difference over all possible pairs of subjects, the first from group 1 and the second from group 2. See this for an example statement of results, and a good reference for this.
Note : When there are excessives ties in the response variable such as a variable with clumping at zero, quantiles such as the median may not be good descriptive statistics (the mean with associated bootstrap confidence limits may be better), and the Hodges-Lehmann estimate does not work well.
Although seen as appealing by some, the so-called “number needed to treat” suffers from a long list of problems and is not recommended.
Confidence intervals for key effects should always be included. Studies should be designed to provide sufficiently narrow confidence intervals so that the results contain use information. See section 6.6 of this and The End of Statistical Significance? as well as this.
In a single-subject-group study in which there are paired comparisons (e.g., pre vs. post measurements), researchers too easily take for granted the appropriate measure of change (simple difference, percent change, ratio, difference of square roots, etc.). It is important to choose a change measure that can be taken out of context, i.e., is independent of baseline. See this for more information. In general, change scores cause more problems than they solve. For example, one cannot use summary statistics on percent changes because of improper cancellation of positive and negative changes.
When there is more than one subject group, for example a two-treatment parallel-group randomized controlled trial, it is very problematic to incorporate change scores into the analysis. First, it may be difficult to choose the appropriate change score as mentioned above (e.g., relative vs. absolute). Second, regression to the mean and measurement error often render simple change scores inappropriate. Third, when there is a baseline version of the final respose variable, it is necessary to control for subject heterogeneity by including that baseline variable as a covariate in an analysis of covariance. Then the baseline variable needs to appear in both the left hand and right hand side of the regression model, making interpretation of the results more difficult. It is much preferred to keep the pure response (follow-up) variable on the left side and the baseline value on the right side of the equation.
Some researchers analyze serial responses, when multiple response measurements are made per subject, as if they were from separate subjects. This will exaggerate the real sample size and make P -values too small. Some researchers still use repeated measures ANOVA even though this technique makes assumptions that are extremely unreasonable for serial data. An appropriate methodology should be used, such as generalized least squares or mixed effects models with an appropriate covariance structure, GEE, or an approach related to GEE in which the cluster bootstrap or the cluster sandwich covariance estimator is used to correct a working independence model for within-subject correlation.
Making conclusions from large P -values | More Information | Absence of Evidence is not Evidence of Absence
In general, the only way that a large P -value can be interpreted is for example “The study did not provide sufficient evidence for an effect.” One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P -value by itself merely means that a higher sample size is required to allow conclusions to be drawn. See section 5.11.3 of this for details, along with this.
There are many ways that authors have been seduced into taking results out of context, particularly when reporting the one favorable result out of dozens of attempted analyses. Filtering out (failing to report) the other analyses is scientifically suspect. At the very least, an investigator should disclose that the reported analyses involved filtering of some kind, and she should provide details. The context should be reported (e.g., “Although this study is part of a planned one-year follow-up of gastric safety for Cox-2 inhibitors, here we only report the more favorable short term effects of the drug on gastric side effects.”). To preserve type I error, filtering should be formally accounted for, which places the burden on the investigator of undertaking often complex Monte Carlo simulations.
Here is a checklist of various ways of filtering results, all of which should be documented, and in many cases, re-thought:
- Subsets of enrolled subjects
- Selection of endpoint
- Subset of follow-up interval
- Selection of treatments
- Selection of predictors
- Selection of cutpoints for continuous variables
There must be a complete accounting of all subjects or animals who entered the study. Response rates to follow-up assessments must be quoted, and interpretation of study results will be very questionable if more than perhaps 10% of subjects do not have response data available unless this is by design.
In a randomized comparison of treatments, the “intent to treat” analysis should be emphasized.
It is not appropriate to merely exclude subjects having incomplete data from the analysis. No matter how missing data are handled, the amount of missing baseline or response data should be be carefully documented, including the proportion of missing values for each variable being analyzed and a description of the types of subjects having missing variables. The latter may involve an exploratory analysis predicting the tendency for a variable to have a missing value, based on predictors that are usually not missing. When there is a significant proportion of subjects having incomplete records, multiple imputation is advisable. Adding a new category to a variable to indicate missingness renders interpretation impossible and causes serious biases. See the October 2006 issue of Journal of Clinical Epidemiology for more about this and for other useful papers about missing data imputation. An excellent book is van Buuren’s.
A commonly used approach to handling dropouts in clinical trials is to use the “last observation carried forward” method. This method has been proven to be completely inappropriate in all situations. One of several problems with this method is that it treats imputed (carried forward) values as if they were real measurements. This results in overconfidence in estimates of treatment effects (standard errors and P -values are too low and confidence intervals are too narrow).
When using traditional frequentist statistical methods, adjustment of P -values for multiple comparisons is necessary unless
- The investigator has pre-specified an ordered priority list of hypotheses, and
- The results of all hypothesis tests performed are reported in the pre-specified order, whether the P - values are low or high
P -values should be adjusted for filtering as well as for tests that are reported in the current paper.
For an example where multiplicity adjustments are unnecessary see the classic Cook-Farewell paper: Multiplicity Considerations in the Design and Analysis of Clinical Trials
In general there is no reason to assume that the relationship between a predictor and the response is linear. However, categorizing a continuous predictor can cause major problems. Good solutions are to use regression splines or nonparametric regression.
Researchers frequently formulate their first model in such a way that it encapsulates a model specification bias that affects all later analytical steps. For example, the March 2009 issue of J Clin Epi has a paper examining the relationship between change in cholesterol and mortality. The authors never questioned whether the mortality effect was fully captured by a simple change, i.e., is the prediction equation of the form f(post)-f(pre) where f is the identify function? It is often the case that the effect of a simple difference depends on the “pre” value, that the transformation f is not linear, or that there is an interaction between pre and post. All of these effects are contained in a flexible smooth nonlinear regression spline surface (tensor spline) in three dimensions where the predictors are pre and post. One can use the surface to test the adequacy of its special case (post-pre) and to visualize all patterns.
Stepwise variable selection, univariable screening, and any method that eliminates “insignificant” predictor variables from the final model causes a multitude of serious problems related to bias, significance, improper confidence intervals, and multiple comparisons. Stepwise variable selection should be avoided unless backwards elimination is used with an alpha level of 0.5 or greater. See also the Stepwise FAQ, this and these articles:
- Step away from stepwise by Gary Smith
- Five myths about variable selection by Georg Heinze and Daniela Dunkler
- Variable selection - A review and recommendations for the practicing statistician by Georg Heinze, Christine Wallisch, Daniela Dunkler
Unless the sample size is huge, this is usually the result of the authors using a stepwise variable selection or some other approach for filtering out “insignificant” variables. Hence the presence of a table of variables in which every variable is significant is usually the sign of a serious problem.
Authors frequently use strategies involving removing insignificant terms from the model without making an attempt to derive valid confidence intervals or P-values that account for uncertainty in which terms were selected (using for example the bootstrap or penalized maximum likelihood estimation). J Clin Epi 2009-03-01, Volume 62, Issue 3, Pages 232-240 cited Ockham’s razor as a principle to be followed when building a model, not realizing that parsimony resulting from utilizing of the data at hand to make modeling decisions only seems to result in parsimony. Removing insignificant terms causes bias, inaccurate (too narrow) confidence intervals, and failure to preserve type I error in the resulting model’s P-values, which are calculated as though the model was completely pre-specified.
When a multivariable model is reported, an unbiased validation (at least an internal validation) should be reported in the paper unless
- The model terms were pre-specified and
- The purpose of model fitting was not to report on the predictive accuracy of the model but to compute pre-specified partial test statistics, estimates, and confidence intervals for a small selected set of predictors or
- The dataset meets the “20:1” rule
The 20:1 rule of thumb is a crude approximation to a much better way to compute sample sizes need for developing predictive models. The 20:1 rule is as follows. Let m denote the effective sample size (the number of subjects if the response variable is a fully-observed continuous one; the number of events if doing a survival analysis; the lower of the number of events and number of non-events if the response is dichotomous) and p denote the number of candidate predictor terms that were examined in any way with respect to the response variable. p includes nonlinear terms, product terms, different transformations attempted, the total number of cutoffs attempted to be applied to continuous predictors, and the number of variables dropped from the final model in a way that was unblinded to the response. If the ratio of m to p exceeds 20, the model is likely to be reliable and there is less need for the model to be validated.
When a validation is needed, the best approach is typically the bootstrap. This is a Monte Carlo simulation technique in which all steps of the model-building process (if the model was not pre-specified) are repeated for each of, say, 400 samples with replacement of size n from the original sample containing n subjects.
When a predictive model or instrument is intended to provide absolute estimates (e.g., risk or time to event), it is necessary to validate the absolute accuracy of the instrument over the entire range of predictions that are supported by the data. It is not appropriate to use binning (categorization) when estimating the calibration curve. Instead, the calibration curve should be estimated using a method that smoothly (without assuming linearity) relates predicted values (formed from a training set) to observed values (in an independent test or overfitting-corrected using resampling; see Stat in Med 15:361;1996). For testing whether the calibration curve is ideal (i.e., is the 45 degree line of identity) consider using the single d.f. Spiegelhalter z-test (Stat in Med 5:421;1986). The mean absolute error and the 90th percentile of absolute calibration error are useful summary statistics. All of these quantities and tests are provided by the R rms package.
As described in detail here, using improper accuracy scoring rules to quantify predictive accuracy gives misleading results and often requires arbitrary dichotomization of predictions. Measures such as sensitivity, specificity, precision, recall, classification accuracy, false positive, and false negative probabilities are not consistent with good decision making and should not be used except in the special case where classification is justified and subjects are sampled according to outcome status (e.g., a case-control study). Besides constructing full-resolution calibration curves as described above, proper accuracy scores should be used to quantify overall accuracy or discrimination ability. Proper accuracy scores include mean squared error, the Brier score (quadratic accuracy score), and the logarithmic accuracy score. For continuous outcomes, mean absolute error is also helpful. The logarithmic accuracy score for probability forecasts is a function of the log likelihood. There are other good measures based on log likelihood such as pseudo R-squared measures. A measure of pure discrimination that is not a proper accuracy score but is still useful is the c-index. The c-index is good for describing the predictive discrimination of a single model but is not sensitive enough for comparing two models. For that a proper accuracy score is required, or a likelihood ratio chi-square test. See here for more information.
Use of Imprecise Language | Glossary
It is important to distinguish rates from probabilities, odds ratios from risk ratios, and various other terms. The word risk usually means the same thing as probability. Here are some common mistakes seen in manuscripts:
- risk ratio or RR used in place of odds ratio when an odds ratio was computed
- reduction in risk used in place of reduction in odds; for example an odds ratio of 0.8 could be referred to as a 20% reduction in the odds of an event, but not as a 20% reduction in risk
- risk ratio used in place of hazard ratio when a Cox proportional hazards model is used; the proper term hazard ratio should be used to describe ratios arising from the Cox model. These are ratios of instantaneous event rates (hazard rates) and not ratios of probabilities.
- multivariate model used in place of multivariable model; when there is a single response (dependent) variable, the model is univariate. Multivariate is reserved to refer to a model that simultaneously deals with multiple response variables.
Graphics | BBR Section 4.3 | Handouts | Advice from the PGF manual Chapter 6
- Pie charts are visual disasters
- Bar charts with error bars are often used by researchers to hide the raw data and thus are often unscientific; for continuous response variables that are skewed or have for example fewer than 15 observations per category, the raw data should almost always be shown in a research paper.
- Dot charts are far better than bar charts, because they allow more categories, category names are instantly readable, and error bars can be two-sided without causing an optical illusion that distorts the perception of the length of a bar
- Directly label categories and lines when possible, to allow the reader to avoid having to read a symbol legend
- Multi-panel charts (dot charts, line graphs, scatterplots, box plots, CDFs, histograms, etc.) have been shown to be easier to interpret than having multiple symbols, colors, hatching, etc., within one panel
- Displays that keep continuous variables continuous are preferred
Tables | BBR Section 4.4 | Examples (see section 4.2)
As stated in Northridge et al (see below), “The text explains the data, while tables display the data. That is, text pertaining to the table reports the main results and points out patterns and anomalies, but avoids replicating the detail of the display.” In many cases, it is best to replace tables with graphics.
- Require that the Methods section includes a detailed and reproducible description of the statistical methods.
- Require that the Methods section includes a description of the statistical software used for the analysis and sample size calculations.
- Require authors to submit a diskette with their data files as a spreadsheet or statistical software file when submitting manuscripts for publication.
- Pay an experienced biostatistician to review every manuscript.
- Require exact P values, reported consistently to 3 decimal places, rather than NS or P<0.05, unless P<0.001 or space does not permit exact P Values - as in a complex table or Figure.
- Require that the Methods section contains enough detail about how the sample size was calculated so that another statistician could read the report and reproduce the calculations.
- Do not allow ambiguous reporting of percentages, such as “The recurrence rate in the control group was 50% and we calculated that the sample size required to detect a 20% reduction would be 93 in each group.” Some authors mean 30% (50%-20%=30%) and some mean 40% (20% of 50% is 10%, 50%-10%=40%). Require that the authors clarify this.
- Print the Methods section in a font the same size as the rest of the paper.
- Require 95% confidence interval for all important results, especially those supporting the conclusions. Require authors to justify the logic of using standard errors.
- Identify every statistical test used for every P value. In tables, this can be accomplished with footnotes and in figures the legend can describe the test used.
- Enforce some consistency of statistical reporting. Do not allow authors to invent names for statistical methods.
- Require that the authors describe who performed the statistical analysis. This is especially important if the analyses were performed by the biostatistics section of a pharmaceutical company.
- Northridge ME, Levin B, Feinleib M, Susser MW: Editorial: statistics in the journal - significance, confidence, and all that. Am J Public Health 87:1092-1095, 1997.
- STROBE guidelines for STrengthening the Reporting of OBservational studies in Epidemiology
- The EQUATOR network for reporting of health research
- Peter Norvig’s Common Mistakes
- Reflections from a statistical editor by Miguel Marino
- Guidelines for reporting of statistics for clinical research in urology
- Guidelines for reporting of figures and tables for clinical research in urology
- Prognosis research strategy (PROGRESS)
Vanderbilt University Department of Biostatistics 2004—