The STROBE explanation and elaboration document https://annals.org/aim/fullarticle/737187/, which describes best practices for observational studies, indicates that “Inferential measures such as standard errors and confidence intervals should not be used to describe the variability of characteristics, and significance tests should be avoided in descriptive tables.” The document goes on to cite why selecting confounders based on p-value criteria is poor practice. However, it’s not clear why simply reporting inferential measures is bad practice. It seems like a perfectly valid question to ask whether, e.g., some patient characteristics vary between two or more (non-randomized) strata.

I am seeing reviewers press for p-values to be omitted from table one of observational studies because of this. Why is this in the STROBE guidelines?

I think the key term here is “descriptive”. Standard errors, confidence intervals and p-values are inferential measures that try to infer something about the population. Descriptive statistics, on the other hand, try to describe the sample characteristics (say Gini’s mean difference or a sample mean, for example).

I don’t think STROBE says “don’t use inferential statistics in observational studies!” but “don’t use inferential measures to describe your sample!”. I think these issues are touched upon in the paper of Douglas Altman (2005): Standard deviations and standard errors. BMJ; 331:903 (link).

Is the reason that the study is not designed/powered to detect differences between groups, thus would give misleading info with high false negative? Is this fair to say?

The rationale to omit “inferential summaries” (standard error/confidence interval/p-value) from descriptive tables is not clearly provided. Because the authors discuss the poor practice of selecting confounders based on p-values immediately afterward, I suspect that part of their rationale was to limit the temptation to do this by excluding p-values and other inferential summaries.

Thanks for sharing the link. You may be right that the authors of the STROBE document were thinking that descriptive summaries should be used only for description. But, what use is description for description sake? We need to incorporate that data into our thinking about the problem at hand, and I think that means drawing inferences. Indeed, in the paper you linked, Altman states that the standard deviation (i.e., a “descriptive” versus “inferential” summary), is an “estimate of the variability of the population from which the sample was drawn.” That makes sense to me, but it makes clear that the even descriptive statistics are used to infer something about a population.

Note that the above is framed primarily at the PS crowd, but I think it captures the concern of using statistical tests to assess whether two samples differ importantly on a variable:

" Significance testing has been frequently used to compare the distribution of measured baseline covariates between treated and untreated subjects in the propensity-score matched sample [7–9]. Imai et al. have argued that the use of significance testing to assess baseline balance in the matched sample is inappropriate for two reasons [29]. First, the sample size of the matched sample is less than that of the original unmatched sample. Thus, there is reduced statistical power to detect imbalance in the matched sample compared with in the original unmatched sample; apparent improvement in balance may be due to the reduced sample size. Second, Imai et al. suggest that balance is a property of a particular sample and makes no reference to a super-population. Imai et al.'s proscription against the use of significance testing has been echoed elsewhere [7, 8]. All of the methods described in Sections 3 and 4 are not influenced by sample size and are not based upon statistical hypothesis testing."

I find the argument that important imbalance should not be assessed with summary statistics that depend on sample size (e.g. p-values) compelling, mostly because I find my work falls into the zone of there being large differences between samples in terms of SMD, but samples can be too small for a t-test to flag these as important.

Thanks for the comment and links! Looks like citations 7-9 also consider addressing the balance in potential confounders after propensity score matching. I think this is a special case, because matching deliberately alters the balance, and it’s not really clear what hypothesis we’re testing when we do this.

If you undertake a household survey with clusters sampled with probability proportional to their size, then of course it makes sense to infer something about the population based on the description of the sample characteristics.

But sometimes your sample is not representative of the population. Maybe you are undertaking a case-control study, and then it doesn’t make much sense to infer the prevalence of a covariate in the population based on the prevalence in the sample. Or maybe you stratified your sample based on some categorical variable, in example an equal number of participants with low, intermediate and high cardiovascular risk.

The p-value is frequently used in Table 1 when you want to test whether the randomization was successful or, in non-randomized experiments, if covariates are equally distributed between the categories of the main exposure variable. This is controversial, because (1) it doesn’t make much sense to test whether the between- groups differences are due to chance when you know they are due to chance, because you made the randomization; and (2) tests without adequately sized/powered samples don’t mean much, and if you are considering it then maybe you should go ahead and fully adjust your analysis but said covariates.

Very good comments and references about the use of p-values and measures of variability in the context of observational studies. In may opinion, descriptive tables in observational studies have two purposes: 1) to provide a picture of who was included in the study and, therefore, to whom the findings of the study could be extrapolated; and 2) to provide a picture of the exchangeability (comparability) of the exposed (treated) and non-exposed (untreated) groups. I think using confidence intervals for purpose 1 would be fine. For instance, one could report the average age of the participants with it’s 95% confidence interval. Reporting standard deviations (SD) or standard errors (SE) seems of little use to me. But this is because I have a hard time making sense of SD and SE, unless I used them to calculate a 95% CI. P-values are useless for purpose (1), since the goal is different from testing a null hypothesis. Regarding the second purpose, I do not use p-values. Even more, if I can get away with it, I do not report crude estimate of the effect of the exposure (i.e. crude comparisons between exposed and non-exposed). I don’t report crude or adjusted effects for other exposures. I just report the frequency/mean of the outcome in each group, exposed and non-exposed, with their 95% CI. I shy away from reporting the crude estimates of effect because they are likely biased and, therefore, not amenable for causal inferences. Statistical tests and confidence intervals for crude comparisons are not valid. Indeed, they are uninterpretable because they assume systematic sources of error (say confounding) have been eliminated before conducting the test. Also, I think reporting them may mislead some readers. As explained before, in the context of RCT, p-values for a comparison of the distribution of prognostic factors in treated and untreated are pointless, because we already know that the process that generated any difference between the two groups was a random assignment. When explaining this to my students, I tell them that Table 1 in an RCT aims to alert us about differences that may make the two treatment groups non-exchangeable (non-comparable). This lack of comparability is a systematic error, not a random error. Statistical tests aim to quantify random error, under the assumption of little or no systematic error. Therefore, a statistical test and the corresponding p-value is the wrong medicine for this patient (Table 1).

Thanks for your comment. I like the way you put the two purposes. However, I want to push back a tiny bit regarding the quote above. The p-values and tests themselves are valid, in the sense that they quantify the evidence against the null hypothesis, e.g., that the means are identical among the strata, and have the correct type-I error under repeated sampling, respectively. In other words, they are valid assessments of association.

I think what you mean here is that these tests are not valid for drawing causal inferences. This is true, of course, in non-randomized studies due to the possibility of confounding. However, it seems that purpose 2 is not an assessment of a causal association, but rather an assessment for any association, where any evidence of the latter would render the groups un-exchangeable.

I argue that statistical tests and p-values are not valid unless systematic errors have been minimized. Only when systematic errors have been minimized can one say that statistical tests and p-values provide a valid assessment of the role of chance on the magnitude of the observed (tested) difference. This pertains to both randomized and non-randomized studies. Of course, you can calculate statistical tests and p-values even if systematic errors (biases) are present, but in such cases they reflect both random and systematic errors and, therefore, are not interpretable. Suppose you conduct a randomized trial in a large sample of patients, so that it is reasonable to expect the treatment groups will be exchangeable. Unfortunately your measurement of the outcome has low sensitivity and low specificity in both treated and control patients. You conduct a statistical test for the effect of the exposure/treatment and found the p-value is non-significant. There are two possible explanations of this finding: 1) the treatment under evaluation has no meaningful effect on the outcome; 2) the random error is so large that one can not reject the null hypothesis. To be able to infer that the treatment actually has no effect on the outcome you have to assume that information error in the measurement of the outcome was minimal. IMO, if you can not make this assumption, the p-value is meaningless.
On the other hand, statistical tests are not useful to evaluate whether groups are exchangeable or not. Lack of exchangeability is a systematic error. Statistical tests are the “wrong medicine” in this context, because they evaluate random error. A small but statistically significant imbalance in a prognostic factor will likely result in an insubstantial lack of exchangeability (i.e. small confounding bias). In contrast, a large but statistically non-significant imbalance in a prognostic factor will likely result in substantial lack of exchangeability (i.e. large confounding bias). This happens because the p-value reflects not only the difference in the level of the prognostic factor, but also the sample size of the study. This applies equally to exposure/treatment randomized and non-randomized studies. If there is concern about lack of exchangeability, you need to address the issue by comparing crude and adjusted estimates of the effect of the treatment/exposure. Of course, a judgement about exchangeability (i.e. about the relevance of the confounding bias) must be based on subject matter, since statistical tests are not appropriated to compare crude and adjusted estimates.