Should adverse effects be adjusted for multiple comparisons?

RCTs are powered for primary end points and adjustments for multiple comparisons are usually used for secondary end points but not for adverse effects. However, underpowered studies are prone to false positive or false negative findings. Should adverse effects be adjusted for multiple comparisons?

For example, the CONFIRM trial found that terlipressin was effective for Type 1 Hepatorenal Syndrome. Death within 90 days (a secondary end points) was not increased after adjustments for multiple comparisons. But it increased death due to respiratory failure in 90 days (without any adjustment for multiple comparisons).

Does this result imply that deaths due to causes other than respiratory failure were increased?
Do we need another adequately powered RCT to prove that terlipressin indeed increased death due to respiratory failure in 90 days in patients with Type 1 Hepatorenal Syndrome?


This result seems like a pretty big red flag (?) Even though the trial met its primary endpoint, overall mortality was also 6% higher in the terlipressin arm after 90 days of followup. I’m not sure any adverse mortality signal should ever be ignored, regardless of whether a primary endpoint was met.


I collected a number of references to multiple comparisons in this thread. This isn’t an easy issue. I’d start with some of Sander Greenland’s recent writings to clarify the issues, which inevitably involve subjective considerations of loss and cost of error.

:new: Peter Westfall (one of the leading experts in this area) recommends reporting both adjusted and nominal p values. Adjusted p values are relevant to the local context for which the study is conducted. Nominal p values can be used by readers who have a different reference set and wish to use the information in the report for a separate analysis.

If I had to apply the recommendations in the literature to this issue, this is what seems reasonable

  1. To be even more charitable to those who have concerns about adverse effects, a 1 sided test (of a sufficiently surprising adverse event under treatment), seems justifiable here.

  2. Based on an understanding of the mechanistic or chemical action of the intervention, create and ordered list of plausible adverse events, and perform un-adjusted analyses (ie. the Cook and Farewell approach advocated by @f2harrell). The \alpha level should be higher than the traditional cut-off for this context due to lower power. (Simply doing a 1 side test at 0.05 would satisfy this, but even that level might be too low).

  1. For unanticipated adverse effects, perform adjusted analyses (but not naive Bonferroni); use a higher than normal Family-wide error (FWER) \alpha, considering the low power of the study under consideration. This treats the test for any unspecified adverse effect as a family (ie. as a disjunction test, or as Frank says “there exists” an effect in this direction). Some disagree and argue that adjustment isn’t needed even in this case:

Rubin, M. (2022), That’s not a two-sided test! It’s two one-sided tests!. Significance, 19: 50-53.

I think treating unspecified adverse effects as a family, using a higher \alpha level as the FWER, but applying the most powerful adjustments, to appropriate 1 sided tests, balances power vs error and maximizes the available information for the question.

I’d be interested in objections to that approach.

If one wants to use the “per comparison error” approach described by Rubin, I’d think they would be duty bound to specify it before the data, as I described in point 2. Otherwise, this is a case where \alpha increases faster than 1-\beta, reducing the information rate, possibly to 0. That can’t be correct.

1 Like