Help required for sample size calculation in a 4 arm study

I am trying to design a trial which will have one experimental arm and 3 standard therapy arms. Considering a baseline incidence of delirium of 60%, with reduction to approximately 40% the experimental arm, what will be the sample size? Do we need to apply Bonferroni correction to the normally used p value of < 0.05? How will this affect the sample size?

All standard online calculators give option of only 2 arms for calculating sample size.

Can some one help please?


It is not uncommon to take the sample sizes for a two arm study, presuming a 1:1 randomization and your power/alpha criteria, and then expand the sample sizes as apropos to >2 arms.

So, for example, if your initial calculation suggests 100 per arm for a two arm study (200 total), you need 400 total for a four-arm study (100 * 4).

As you note, the key then becomes defining an appropriate threshold to use for alpha for each comparison, given the presumably planned multiple comparisons and any adjustments to be applied.

If, for example, you plan to do 3 comparisons, and want to use Bonferroni, which is very conservative, you would use 0.01667 (0.05 / 3) for the alpha threshold in the above power/sample size calculations, instead of 0.05.

The question is, which pre-defined comparisons are you interested in at the end of the study, and might you do an initial omnibus test comparing all four arms, and then only do specific inter-arm comparisons if the omnibus test is significant? That kind of planned, tiered approach, for some, will obviate the need for subsequent multiplicity adjustments, and there is no clear consensus on such adjustments. It comes down to how you want to control for possible type I errors and there are numerous ways to do so.

Will you do all possible pairwise comparisons of the 4 arms (6 comparisons)?

Will you only compare the experimental arm to the 3 standard of care arms as paired comparisons (3 comparisons)?

There is a recent article here:

that you might find of interest, providing a general framework for designing multi-arm trials, with various multiplicity controls, elucidating some of the pros and cons of each. The authors have published an R/Shiny web application here:

that you can use for continuous (normal) and binomial outcome variables, based upon their paper.

You may find those resources helpful. There is a great deal of literature on multiple comparisons, with Hsu’s book:

perhaps being one of the more popular, and which I have on my shelf.

1 Like

Thanks that is really helpful. Another question (will probably sound stupid to you). Of the 3 standard arm, what type of test is needed if you want to decided which one is the better one, if the treatment arm option is NOT available to user?
What I mean is Treatment arm proves better than all 3 standard treatments? (that is 3 paired comparisons). Is there a way to find out which of the 3 is better than the other 2?
I will go through the literature suggested by you and hope it is OK to have more questions. Sometimes knowing a little more confuses me more!


It sounds like you have may multiple hypotheses to test, not only if the experimental treatment is better than the 3 standard of care (SOC) arms, but perhaps if one of the 3 SOC arms is better than the others, at least for the profile of patients in the study?

You had indicated in your initial post, that you expect a reduction in delirium from 60% at baseline to 40% post treatment in the experimental arm. What reduction, if any, from the 60% do you expect in the 3 SOC arms?

In order to formally compare the 3 SOC arms against each other, you would need to define a minimum clinical difference that you wish to be able to detect and test for, in order to have a large enough sample size, and therefore power, to formally test that hypothesis at a pre-defined alpha.

If the difference across the 3 SOC arms is relatively small clinically, as compared to the difference in the experimental arm, that may not be something that you can formally test from a practical standpoint, because it will materially inflate your required sample size for the study, increasing costs, risks, and other factors. Thus, you may only be able to assess those differences on an exploratory basis.

You should have access to historical data that can give you some insights into whether or not that comparison is reasonable from a study design perspective, and what clinical parameters would be relevant in deciding how to treat a patient in the absence of the experimental treatment, presuming that the experimental treatment is ultimately better. That information should be reasonably well established for the SOC treatments.

A first step for you is to explicitly define, a priori, what hypotheses you wish to formally test, and what the explicit assumptions are for the differences in outcomes across the 4 arms. Once you can do that, you can better begin to assess the study design, and how to go about determining what sample size(s) you may need.

You may also have to determine a primary hypothesis, for which you formally power the study, and perhaps one or more secondary hypotheses, for which the study is not formally powered.

If you have access to local statistical expertise, that may be a better approach for you, rather than going back and forth here, as you can work interactively and more productively through the details of the study design process.

1 Like

If the difference across the 3 SOC arms is relatively small clinically, as compared to the difference in the experimental arm, that may not be something that you can formally test from a practical standpoint, because it will materially inflate your required sample size for the study, increasing costs, risks, and other factors.
This is perfect and yes this was exactly what I was looking for! We do have historical data to suggest which of the SOCs arms are going to (good to worse) 1, 2 , 3.
I am very much obliged for the help and feel now can proceed with my sample size calculation and then the study.
warm regards

1 Like

In addition to Marc’s excellent advice, consider using simultaneous confidence intervals which adjust for all possible comparisons and are not as conservative as Bonferroni and most other p-value adjustment methods.

1 Like

Thank you very much for your wonderful help.
warm regards

Another naive question. When I calculate the sample size for 2 arms (for applying to 4 arms), do I use the usual p value (<0.05) or corrected p value for 4 arms (i.e. < 0.0125).
With p value < 0.05, the sample size is 67 per arm. (Therefore 268 total), where as for p value < 0.0125, it is 95 per arm (i.e. 380), which sounds correct for this p value.
Is my second calculation correct?
Harell, I am not quite so good at statistics so I lost you at the first word.
What do I need to do differently, if not use the Bonferroni correction? In other words how to calculate the sample size using “simultaneous confidence intervals which adjust for all possible comparisons”
Please forgive my ignorance and the stupid questions.
warm regards

You need to do a lot of background reading, and don’t emphasize p-values so much. Start with

1 Like