Sample size calculation - ANCOVA

JuliaRose · November 3, 2025, 12:33pm

In our group we are planning a small RCT. We have data from an feasibility study and the setting is the following:

three arms: Placebo, Therapy1, Therapy2
two timepoints of measurement of our parameter of interest: Baseline, Follow-up
possible confounders are measured, lets say blood-pressure and age

We want to compare the effect of Therapy1 to Placebo and Therapy2 to Placebo.
Our hypothesis is, that there is no meaningful difference in Therapy1 vs. 2, but to show “no difference” the sample size would have to be much higher I think.

How do I compute the needed sample size for these comparisons?
I thought of using library(Superpower) and power_oneway_ancova()but I dont know how to specify the r2 parameter?? (Superpower: Power Calculations for ANCOVA & Power Analysis and Sample Size Planning in ANCOVA Designs | Psychometrika | Cambridge Core)

The other ones are n_cov=3 (age, blood-pressure, FU-measurement), sd=pooled_sd, mu=means, alpha=0.05 (or 0.025 because of multiple testing?), beta_level=0.2

Is r2 just the correlation between baseline and followup without adjustment by age and bp?
Or do I have to run a lm(baseline-FU ~ age + bp + group, data=data)or lm(FU ~ age + bp + baseline, data=data)?
I found an “ancova adjusted effect size of Cohen’s d by d_adj = d / sqrt(1-R2) and R2 is from the lm, but I dont know why?

It would be super helpful if someone could guide me towards the right direction how to compute the sample size.
I made some sample data:

data ← data.frame(
group = c(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2),
age = c(40, 50, 45, 45, 48, 52, 40, 44, 38, 50, 46, 47),
bp = c(120, 110, 115, 140, 120, 140, 130, 120, 120, 130, 140, 150),
baseline = c(6, 8, 5, 6, 7, 5, 9, 7, 8, 8, 5, 6),
followup = c(6, 7, 5, 5, 5, 5, 6, 6, 7, 5, 5, 5)
)

mean_baseline = c(6.25, 7, 6.75), mean_followup = c(0.5, 1.5, 1.25)

PS.: The statistical consultancy at our clinic who are responsible for trials, just told me to ran a t-test and use the estimated sample size, cause its an estimator anyways and the adjustments are irrelevant. This seems not to be the right way for me, so I decided to get into it by myself…

JuliaRose · November 4, 2025, 1:59pm

Two additional questions came to my mind:

Am I right, that if a covariate doesn’t change over time it does not control any additional variance and it is not needed to use it in the sample-size estimation?
Which one of those is the “right”/most efficient model to use?
1. lm( baseline - FU ~ age + bp + group, data=data)
2. lm( baseline - FU ~ age + bp + group + baseline, data=data)
3. lm( FU ~ age + bp + group + baseline, data=data)
  or even 4. lm( value ~ timepoint*group + age + bp + group, data=data), where value is baseline & FU

JuliaRose · November 4, 2025, 3:19pm

My current approach is the following:

compute means & pooled sd
get R2 from lm(baseline - FU ~ age + bp + baseline, data = data)
use power_oneway_ancova

But I am completely unsure if the R2 is the right one

Here is the full code, if somebody wants to look at it:

data ← data.frame(
group = c(0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2),
age = c(40, 50, 45, 45, 48, 52, 40, 44, 38, 50, 46, 47),
bp = c(120, 110, 115, 140, 120, 140, 130, 120, 120, 130, 140, 150),
baseline = c(6, 8, 5, 6, 7, 5, 9, 7, 8, 8, 5, 6),
FU = c(6, 7, 5, 5, 5, 5, 6, 6, 7, 5, 5, 5)
)

data$difference ← data$FU - data$baseline

mean_control ← mean(data$difference[data$group == 0])
mean_treat1 ← mean(data$difference[data$group == 1])
mean_treat2 ← mean(data$difference[data$group == 2])
sd_control ← sd(data$difference[data$group == 0])
sd_treat1 ← sd(data$difference[data$group == 1])
sd_treat2 ← sd(data$difference[data$group == 2])

n ← 4

sd_pooled ← sqrt((n-1)(sd_control^2 + sd_treat1^2 + sd_treat2^2) / (3*n - 3))
sd_pooled

# simple correlation:
# R2 ← cor(data$FU, data$baseline)^2

# FU model with baseline as covariate
# R2 ← summary(lm(FU ~ age + bp + baseline, data = data))$r.squared

# Difference model with baseline as covariate
R2 ← summary(lm(baseline - FU ~ age + bp + baseline, data = data))$r.squared

library(Superpower)

power_oneway_ancova(
mu = c(mean_control, mean_treat1, mean_treat2),
n_cov = 3,
sd = sd_pooled,
r2 = R2,
alpha_level = .05,
beta_level = .2,
round_up = TRUE,
type = “exact”
)

Louis_Martin · November 6, 2025, 5:12pm

Hi Julia,

A few thoughts:

My understanding is that it is better to include baseline measurements as a covariate rather than using a difference score as an outcome.
I noticed that in your sample data, there is not much variation in the outcome. Maybe this is just how you set up the example, but I wonder if your outcome is ordinal and should be assessed with ordinal methods.
I believe that the function your using requires R^2 from a covariate-only model: R2 ← summary(lm(FU ~ age + bp + baseline, data = data))$r.squared
R^2 does matter and should not be ignored. The more variation in outcome associated with covariates, the greater power you have.

I was curious, so I ran some simulations to better understand the relationship between covariate-associated R^2 and power. Below, I simulated values of outcome, y, based on prespecified group means and sample Baseline measurement (I included only one covariate for simplicity). For each simulation, I randomly varied the association between y and Baseline and determined the relationship between power and covariate-only R^2. As you can see, higher correlation between Baseline and y leads to more powerful analyses.

library(tidyverse)
library(Hmisc)
library(magrittr)
N <- 15 # Sample size per group
# Three group means. Group 0 is placebo, groups 1 and 2 are different treatments
mean0 <- 6; mean1 <- 8; mean2 <- 8 # We expect treatments to be equally effective
grand_mean <- mean(c(6, 8, 8)) # Total mean
sd <- 2 # Standard deviation of outcome

# Create dataframe with groups
d <- tibble(group = c(rep(0, N), rep(1, N), rep(2, N)))

# Add means
d %<>% mutate(
  mean = case_when(
    group == 0 ~ mean0,
    group == 1 ~ mean1,
    group == 2 ~ mean2
  ))

# Number of repeated simulations
reps <- 10000

# Store simulation outcomes
out <- tibble(
  # R2 with just covariates (only baseline in this example)
  r2_covariates = rep(NA_real_, reps),
  # Total R2 in full model with treatement and covariate
  r2_total = rep(NA_real_, reps),
  # P value from full model
  p  = rep(NA_real_, reps)
)
for (i in 1:reps) { # For each simulation...
  d1 <- d %>%  
    # Y is maken from a normal distribution with mean based on group
    mutate(y = rnorm(n(), mean, !!sd)) %>% 
    rowwise() %>% 
    # Baseline is taken from normal distribution based on total mean/sd
    mutate(baseline = mean(c(
      # The correlation between baseline and y will be altered by the number
      #- of samples taken from the overall mean compared to the one sample
      #- that uses y as mean
      #- If one sample using grand mean and one from y, very high correlation
      rnorm(as.integer(runif(1, 1, 30)), !!grand_mean, !!sd),
      rnorm(1, y, !!sd)
    )))
  
  # Update output
  out$r2_covariates[[i]] <- summary(lm(data = d1, y ~ baseline))$r.squared
  out$r2_total[[i]] <- summary(lm(data = d1, y ~ group + baseline))$r.squared
  out$p[[i]] <- anova(lm(data = d1, y ~ group + baseline))['group', 'Pr(>F)']
}
options(scipen = 999)
# Cut R2 (from covariate only model) into bins
out %>% 
  mutate(cuts = cut2(r2_covariates, cuts = seq(0, 1, by = .05))) %>% 
  group_by(cuts) %>% 
  # Find number of models in each bin and power (with alpha = .05)
  summarise(
    mean_r2_covariates = mean(r2_covariates),
    number_of_models = length(r2_covariates),
    power = mean(p < .05)
    )

Here’s the output showing an increase in power with greater R^2 from the covariate.

# A tibble: 10 × 4
   cuts        mean_r2_covariates number_of_models power
   <fct>                    <dbl>            <int> <dbl>
 1 [0.00,0.05)             0.0239             2428 0.726
 2 [0.05,0.10)             0.0750             2456 0.759
 3 [0.10,0.15)             0.124              2080 0.785
 4 [0.15,0.20)             0.173              1434 0.815
 5 [0.20,0.25)             0.222               895 0.819
 6 [0.25,0.30)             0.271               414 0.812
 7 [0.30,0.35)             0.320               193 0.850
 8 [0.35,0.40)             0.371                70 0.914
 9 [0.40,0.45)             0.421                24 0.917
10 [0.45,0.50)             0.466                 6 0.833

And a graph of findings

# Plot
ggplot(out, aes(x = r2_covariates, y = log10(p))) + geom_point() + geom_smooth()

f2harrell · November 6, 2025, 6:19pm

Side comment: It is important not to compute change from baseline. However we frequently approximate a baseline covariate-adjusted analysis by pretending the the slope of the baseline measurement is 1.0 and that baseline has a linear effect. Thus we use historical change from baseline data in power and precision calculations even though we don’t use differences in final analyses.

JuliaRose · November 6, 2025, 6:59pm

Thank you soo much already, that helps a lot. As well as thank you for the really helpful provided link @f2harrell, I was already reading into the chapters of your book to find some information, but did not looked far enough.

That’s a really good question, I am yet not 100% sure about. My outcome is a variable which only can be integers between 0 and 10. And as in my sample data the change will be around -2 to +3. The change from 0 to 1 or 6 to 7 and so on remains the same, so it is a metric, but on integer scaling.
Therefore I also thought of using ordinal methods, but then thought it would be more appropriate to use it as a continuous variable. (side note: I also struggled to find the right ordinal methods for sample size calculation for a multiv. ordinal model)
In the sadly sparse literature about our parameter of interest, it is often kind of trichotomized into <-1, -1 to 1, >1. But this feels like a similar version to dichotomizing an continuous value, which I definetly can not support…

Happily you are supporting my first intuition, which I discarded later also cause of this paper:

When the outcome is also measured at baseline (Y0), the change scores (Y1-Y0) between the treatment groups can be compared, again using a t-test. Another approach is to use analysis of covariance (ANCOVA) and to analyze Y1 or Y1-Y0 in a linear regression model that includes treatment group and Y0 as independent covariates (Y1|Y0 or Y1-Y0|Y0).

Well, but together with your really helpful simulation, and @f2harrell’s points on not using difference/change measures it makes the whole thing clearer.

Two remaining question:

In my situation with an integer outcome between 0 and 10 (& distance of 1 is meaningful between all integers) would an ordinal model be more appropriate? And how would the computation change?
and if not, am I right, that with your @Louis_Martin suggested R2 R2 ← summary(lm(FU ~ age + bp + baseline, data = data))$r.squared my approach via power_oneway_ancovais correct?

f2harrell · November 7, 2025, 12:20pm

For power/sample size calculations for ordinal outcomes (but without covariate adjustment) see this.

JuliaRose · November 7, 2025, 1:01pm

thank you, that’s the way I am used to do sample size calculations for ordinal outcomes, but the covariate adjustment makes me struggling, and my feeling is, that these covariate effects shouldnt be neglected in my case, but I am by far not an expert in this.

Louis_Martin · November 7, 2025, 2:27pm

It’s hard to know how to treat the outcome without knowing more about it. Integer outcomes like what you have are typically either counts or some ordinal measurement (e.g. pain). To meet the assumptions of ordinary least squares regression, the distance between integer values needs to be the same; that is, the difference between 0 and 1 is the same as the difference between 5 and 6. An aggregate “objective” measure like a visual exam score is still an ordinal outcome. You know that higher numbers are better, but you can’t say how much better.

Even if you could do your analysis with ordinary least squares, it’s still appropriate to use ordinal regression—the analysis will just have to be interpreted differently and your sample size may need to be a bit higher (perhaps 5-10%). You could use simulation similar to what I did above, creating data sets similar to what you think the real data will look like and seeing how many have p < \alpha for a given sample size. Without doing that, you could just use the power/sample size methods that Frank mentioned without covariates as a conservative estimate, knowing that variance explained by covariates will increase your power.

JuliaRose · November 7, 2025, 2:54pm

Without diving too deep into the backgrounds, the outcome measure is very very similar to the “RPE scale” (Rating of Perceived Exertion), while most of the time in the literature it is said that differences are the same on the whole range of the scale, so it can be seen as continuous value.

I did some simulation attempts and my gut feeling is, that I really should not neglect the covariate effect, so the standard methods to do the estimation can not be used. Therefore I think the power_oneway_ancova approach will be the right way. Thank you for all your amazing input!!

f2harrell · November 7, 2025, 4:34pm

Yes in grant proposals I’ve frequently used the conservative approach when we don’t have already an observational study that shows how the covariates relate to outcome. I add some text such as “The power calculation is conservative because of lack of accounting for explainable within-treatment outcome heterogeneity due to covariates. The actual analyses will be covariate adjusted and will have higher power than estimated here.”

But as mentioned before if the outcome were truly continuous and you had an estimate of the SD of the change from baseline for control patients, you can do better by approximating ANCOVA with “assume a slope of 1.0” change scores.

arcaldwell49 · November 10, 2025, 6:59pm

Hi,

Thanks for tagging me in this discussion. A few notes on the discussion.

Regarding the R² parameter:

The R2 in power_oneway_ancova() represents the hypothetical proportion of variance in the outcome that would be explained by the covariates without the treatment effect included in the model. This is essentially the predictive relationship between your covariates and the outcome variable.

Important caveat about using observed data for power calculations:

I would be extremely cautious about using observed data from your feasibility study to make power calculations. There are several reasons for this concern:

Small sample sizes (n=4 per group in your example) produce highly unstable estimates of effect sizes, standard deviations, and R-squared values
These estimates are likely to be biased and may not reflect the true population parameters
Power analyses based on observed data (a form of post hoc power) can be overly optimistic
The uncertainty around these parameter estimates is substantial but isn’t captured

Better approaches to consider:

Use effect sizes from published literature in similar populations/interventions to help gauge plausible effects
Rely on theory to dictate the effect size and other inputs for the analysis (e.g., theory could dictate that at least 15% of the variance in FU is explained by these covariates).
Consider conducting sensitivity analyses with a range of plausible effect sizes and R-squared values
As also suggested, look the “conservative” no-covariate estimate of power and the ordinal model approach as at least robustness checks on these calculations.

Regarding your additional questions:

You’re correct that time-invariant covariates like age don’t explain within-person change, but they can still improve precision by explaining between-person differences in the outcome. Stephen Senn has written many fantastic articles/blog posts about this.
As mentioned by Dr. Harrell, avoid using difference scores (baseline - FU) as your outcome. The standard ANCOVA approach (option 3: lm(FU ~ age + bp + baseline + group)) is preferred
Be careful about overfitting. You mentioned this is a “small RCT” so I’d be very cautious about adding too many covariates to the model. Just baseline as a covariate may be sufficient/prudent. If that is the case, you could use the test-retest correlation as benchmark for input to the R2 argument (e.g., if test-retest reliability is r = 0.95 you could, probably overoptimistically, assume that the R2 of the covariates is approx 0.9025)

Additionally, the PASS documentation on ANCOVA might be a helpful resource.

Hope this helps guide you in the right direction.

f2harrell · November 10, 2025, 10:41pm

My $.02 on that is that observed estimated effects should not be used in power calculations, but instead the subject matter knowledge-driven effect one would not want to miss. And how about using R^{2}_\text{adj} from previous data in the power calculation? And an estimtated \sigma^2 that incorporated the usual error degrees of freedom would be unbiased, but your point about instability still holds; sometimes I think the 0.75 quantile of \sigma^2 should be used to be safer.

arcaldwell49 · November 11, 2025, 4:29pm

No disagreement from me on any of those points! I am very hesitant to recommend looking at observed/published effect sizes, but I think that could be better than using pilot data for this type of information.

Regarding the R² estimate. I would probably prefer using predicted R-sq rather than adjusted R² for this application. I would assume this gives you a better sense of how well the covariates would predict the outcome in new data.

I’d also appreciate some clarification on your comment about using the 0.75 quantile of σ²? Are you suggesting:

Using a bootstrap approach to generate a distribution of σ² estimates and then taking the 75th percentile?
Using the upper bound of a confidence interval for σ² (e.g., from the chi-squared distribution)?
Something else entirely?

With such small pilot samples, I can see the logic in being conservative with the variance estimate, but I want to make sure I understand correctly.

f2harrell · November 11, 2025, 5:42pm

0.75 quantile of bootstrap values would work. One-sided upper 0.75 confidence limit would also work.

JuliaRose · November 12, 2025, 10:24am

Thanks for adding these important points.
I am very aware of the problems, relying on the data of a feasibility study. Luckily we have much more than n=4 per group, but besides this the regulatory side sadly wants these references and is not accepting any theoretical approaches. Literature isn’t available for our case, cause it is the first study ever done with this parameter in this specific setting.

Your point about sensitivity analyses and robustness checks is very good and I will definetly do some more calculations in there.
I also will have a look at the test-retest correlation.

Thank you very much for the valuable input!!

JuliaRose · November 12, 2025, 10:36am

Never thought in that way, but this is a very interesting point. I always thought of including a “conservativeness-margin” for σ², but wasn’t sure how to do the argumentation for non-statistician-regulators. The same is for argumentation with “effect I don’t want to miss”, as I had rejected this argumentation several times before and since then I always try to include a computation to get everyone on board…

I hope it is okay, if I ask one additional question to you both (or any other reader in here):

We were planning to use minimal sufficient balance randomization, to avoid group imbalances which could be a crucial factor for further in depth explorative analyses. Would you include the covariates used in the randomization also in the main analysis model?

ChrisOdonnell · November 20, 2025, 12:19pm

As stated in the ICH E9 guideline §2.3.2 :

The use of a dynamic allocation procedure (see below) may help to achieve balance across a number of stratification factors simultaneously provided the rest of the trial procedures can be adjusted to accommodate an approach of this type. Factors on which randomisation has been stratified should be accounted for later in the analysis.

JuliaRose · November 26, 2025, 12:32pm

Thanks, did not look up there, very helpful!

jesoje · November 27, 2025, 8:23am

See this paper by Senn et. al for a recommendation on how to include variables that are used for stratified randomization in your regression model.