Sample size linear regression

I wan to calculate a sample size for a study that wants to assess the association between X and Y, where X and Y are both continuous variables.
I am unsure whether the relationship between X and Y is linear.
I thought about using linear regression (in case X and Y were linear), but looking through statistics textbooks and googling, I did not find a way to calculate the sample size required for linear regression other than using R2-value or a rule of thumb for participants per variable (which I read in other papers is not recommended).

Can someone guide me to the right literature for a sample size calculation for linear regression or suggest a way?
Also, is there a way to calculate the sample size for a regression model when we do not assume linearity between X and Y and use restricted cubic splines?

Thank you!

See here for an example simulation related to sample size for adequate estimation precision when using a restricted cubic spline with logistic regression.

Though not as ideal as considering estimation errors, it is very easy to compute N such that the margin of error in estimating a correlation coefficient is acceptably low. Typically you’ll see minimum sample sizes of 300 or more in the continuous Y case. See here.

The best approach is by Riley et al.

1 Like

Thank you very much for your advice, @f2harrell !

I read about the approach by Riley at al. Can this approach be used for the sample size calculation of a regression model that does not aim to predict, but is explanatory / assess an association between certain variables Xi and the outcome variable Y?

If yes (and since no shrinkage would be done for an explanatory regression model), would it be reasonable to set the shrinkage parameter to 1 (i.e., no shrinkage if I understand correctly)?

The Riley method would give you a sample size that is perhaps twice as large as you need for your purpose. Regarding shrinkage, leave the main variables of interest unpenalized, and penalize effects of adjustment variables if there are too many of them for the sample size. See