CREOLE Trial: Interpretation of results doesn't line up with statistical plan?

Hey all, this is my first post on this board. I was referred by Dr. Brian Byrd from Twitter. I would really appreciate any thoughts on how to interpret this recent paper:

It seems to me that the authors’ conclusions don’t line up with their statistical analysis plan. They compared three groups and applied a Bonferroni adjustment. They state the following in their statistical analysis section: “Since the P values are not adjusted for multiple comparisons among the three groups, the values should be interpreted in the context of the planned P value threshold of 0.017 (0.05÷3) after Bonferroni adjustment.” However, their conclusion doesn’t seem to line up with their stated P value threshold.

Am I missing something?


Hi Stan. I picked up on similar questions about the paper after you referred me to it. I will be interested to see what others think!

1 Like

I haven’t yet looked behind the paywall, but can you answer an unrelated question? The authors seem to fall into the trap of analyzing change from baseline instead of covariate-adjusting the raw follow-up blood pressure. For (only) an ordinary linear model you can rescue such an analysis by covariate adjusting the change from baseline for the baseline, i.e., making baseline appear twice in the model. Did the authors do that?

Related to your question, I’m one of those statisticians who don’t like multiplicity adjustment for answering pre-specified pre-prioritized separate questions. My favorite paper on this is Cook and Farewell.


Frank, yes, they did adjust for baseline BP as a covariate in their change-from-baseline-BP model. I’ve marked what I see as discussion-worthy sections with bold-face text.


The sample size of 702 patients was based on the detection of a minimal clinically important difference of 3.0 mm Hg in the ambulatory systolic blood pressure on the assumption of a standard deviation of 9.0 mm Hg.21 To allow for a 10% dropout rate, we required the enrollment of 210 patients who could be evaluated in each group so that the trial would have a power of at least 84% at a significance level of 0.05, using a conservative adjustment for the three comparisons. Since the P values are not adjusted for multiple comparisons among the three groups, the values should be interpreted in the context of the planned P value threshold of 0.017 (0.05÷3) after Bonferroni adjustment.

For the primary end point, we used a linear mixed-effects model to estimate the mean difference in the ambulatory systolic blood pressure between the groups. We used the restricted maximum-likelihood method to fit the model, which included adjustment for the baseline ambulatory systolic blood pressure, age (<55 or ≥55 years), and trial site as a random effect.

Two sensitivity analyses were performed. First, in order to increase the precision of the estimate of the treatment effect, we adjusted the model for clinically important and other variables: sex, the presence of diabetes or dyslipidemia (total cholesterol, >5.2 mmol per liter [200 mg per deciliter] or the receipt of statins), body-mass index (BMI), pulse rate, and duration of hypertension. Second, we performed multiple-imputation analyses using chained equations for patients who had a missing primary end-point value. We generated five imputed-data sets with a maximum number of 1000 iterations, with linear imputation for continuous variables and logistic or multinomial regression for categorical variables. Variables that were included in the imputation model were treatment group, ambulatory systolic and diastolic blood pressures, office blood pressure measurements, age, sex, trial site, BMI, presence of diabetes or dyslipidemia, duration of hypertension, and pulse rate. We used the same framework to analyze ambulatory systolic and diastolic blood pressures.

The mean between-group difference in office blood pressure was estimated for each time point with the use of a linear mixed-effects model that included the patient as a random effect, along with age, baseline office blood pressure, follow-up duration, and interaction between time and treatment. We used logistic-regression analysis to compare the between-group response rate in office blood pressure after adjustment for age and site. All estimates of treatment effect were calculated with a 95% confidence interval. We used P values interpreted as a continuous measure to evaluate the strength of the evidence against the null hypothesis of no between-group difference.

We did not plan for multiple-comparison adjustments for secondary outcomes, so the results are reported with point estimates and 95% confidence intervals only. The widths of the confidence intervals have not been adjusted for multiple comparisons, which should be taken into consideration when interpreting these results.

We performed an efficacy analysis using the intention-to-treat principle and included all the patients for whom primary end-point data were available. Adverse events were assessed in all the patients until the end of follow-up and were tabulated according to trial group. All analyses were performed with Stata software, version 15 (StataCorp), and SPSS software, version 20 (IBM). The statistical methods are detailed in the Supplementary Appendix, available at"

An aside: the Supplementary Appendix states who did the statistical analysis & reports some additional results, but I could not find any other details of the statistical methods in the appendix.


Methods: “Patients were… assigned in a 1:1:1 ratio to one of the three treatment groups …” / “The primary end point was the mean change in the 24-hour ambulatory systolic blood pressure…” / " Since the P values are not adjusted for multiple comparisons among the three groups, the values should be interpreted in the context of the planned P value threshold of 0.017 (0.05÷3) after Bonferroni adjustment."

Results: " the between-group difference in the change from baseline was …P=0.03 for amlodipine plus hydrochlorothiazide and … P=0.04 for amlodipine plus perindopril … The difference between the group receiving amlodipine plus hydrochlorothiazide and the group receiving amlodipine plus perindopril was … P=0.92"

discussion: "In conclusion, … we found that among three commonly recommended drug combinations, amlodipine (a long-acting dihydropyridine calcium-channel blocker) combined with either an ACE inhibitor (perindopril) or a thiazide diuretic (hydrochlorothiazide) was superior to perindopril plus hydrochlorothiazide in lowering both ambulatory and office blood pressures… Since the P values for the three comparisons of the primary end point were not adjusted for multiple comparisons, the conclusions must be interpreted with caution. … "

thus they seem to encourage the reader to interpret the results with a more conservative threshold for statistical significance but are reluctant to do this themselves. Maybe we have now entered some grey area where we haven’t quite discarded the old interpretation of statistical significance and are tentatively moving towards what is advocated here: Retire statistical significance

edit: i was re-reading an old john whitehead paper: “A phase III clinical trial, executed according to plan, and analysed to answer the original questions posed, is one of the most powerful tools of clinical research. We relax its rigour at our peril.” the case for frequentism in CTs - food for thought

1 Like

That’s an excellent summary. The whole idea of multiplicity adjustment here is quite silly, but the authors got convinced by some statistician that it should be done, so they should have stuck to it. See the Cook & Farewell paper. If you are a frequentist, multiplicity adjustment should be reserved for such hypotheses as “there exists a subgroup for which the treatment has an effect, and we tested 15 subgroups” (even though that would be the worst possible way to look for differential treatment effect).

I don’t see why a logistic model was mentioned, and note that the authors’ model did not adjust for age. It adjusted only for an extremely coarse version of age.

1 Like

Interesting thought. It seemed to me that they didn’t move toward discarding the old interpretation of P values, but rather chose to interpret the results using a cutoff of P < 0.05 as opposed to the interpretation recommended by their statistical analysis plan (flawed as it was).

yes, that seems right ie, if one of the pvalues was 0.06 maybe they’d be compelled to employ the “trend towards significance” phrase. But this phrase, as common as it is, shows a contempt for statistical criteria and an impulse to loosen things up. But i’m on board with Ioannidis’ claim: “Removing statistical significance entirely will allow anyone to make any overstated claim about any result being important.” Although Sander Greenland responded to that somewhere…