The number of subjects per variable required in linear regression analyses

I would like to discuss this work, wherein Austin and Steyerberg aim to estimate the number of samples required to adequately estimate a regression coefficient.

Their first approach to doing this is to take a dataset, fit an OLS full model to it, take subsets of increasing size from this full data, estimate outcomes for the reduced subsets using the coefficients of the full model, then estimate the coefficients of the reduced subset models and compared with the full model.

But, it seems to me noteworthy here that the performance of the full models was very low. From the dataset, they used two variables as outcome variables (systolic blood pressure and heart rate at discharge) and the R-squared were 0.24 and 0.10.

They proceed to show that coefficients could be reasonably estimated with only 2 samples/coefficient. I would think, though, that this finding rather supports that the coefficients of a poor model can be estimated with only 2 samples per coefficient. That is to say, if the coefficients of the full model are hardly useful in predicting the outcome, then maybe smaller subsamples are required to reconstruct them. Maybe this is totally incorrect of me, and thus I welcome discussion.

Admittedly, they do proceed to show similar findings in subsequent simulations, but I understand that simulation results should always come with the disclaimer that the findings of the experiment are confined to the conditions of the simulation generation process.

This is no critcism to the authors, I just hope to gain a deeper understanding about their methodology and findings. The number of samples requried to estimate a regression coefficient is a very relevant question in the current high-dimensional data environment and the results concluded by the authors appear much lower than other results I’ve came across, hence the curiosity!

1 Like

I hope that @Ewout_Steyerberg @ESteyerberg can respond.

Your starting point is problematic because R^2 does not inform you of bad performance without taking into account the difficulty of the task. R^2=0.1 may be as good as anyone could get for the dataset. The question, I think, needs to be framed in terms of not losing much from whatever an “optimum” R^2 is for the problem at hand.

I hope so too, thanks Professor Harrell.

I guess I’m exposing my ignorance here then because I’d say in most situations that I am familiar with an R-squared of 0.1 would be rather disappointing (and I’d say this applies to the case in the paper). I understand test statistics are often problem-specific but I would have thought R-squared to be relatively immune to this. Could you give me an example in which R-squared = 0.1 might be considered a win?

Example: The difficult task of predicting the number of days until an individual dies. Expect the R^2 to be low. That doesn’t mean it’s not the best model for the task.