I would like to discuss this work, wherein Austin and Steyerberg aim to estimate the number of samples required to adequately estimate a regression coefficient.
Their first approach to doing this is to take a dataset, fit an OLS full model to it, take subsets of increasing size from this full data, estimate outcomes for the reduced subsets using the coefficients of the full model, then estimate the coefficients of the reduced subset models and compared with the full model.
But, it seems to me noteworthy here that the performance of the full models was very low. From the dataset, they used two variables as outcome variables (systolic blood pressure and heart rate at discharge) and the R-squared were 0.24 and 0.10.
They proceed to show that coefficients could be reasonably estimated with only 2 samples/coefficient. I would think, though, that this finding rather supports that the coefficients of a poor model can be estimated with only 2 samples per coefficient. That is to say, if the coefficients of the full model are hardly useful in predicting the outcome, then maybe smaller subsamples are required to reconstruct them. Maybe this is totally incorrect of me, and thus I welcome discussion.
Admittedly, they do proceed to show similar findings in subsequent simulations, but I understand that simulation results should always come with the disclaimer that the findings of the experiment are confined to the conditions of the simulation generation process.
This is no critcism to the authors, I just hope to gain a deeper understanding about their methodology and findings. The number of samples requried to estimate a regression coefficient is a very relevant question in the current high-dimensional data environment and the results concluded by the authors appear much lower than other results I’ve came across, hence the curiosity!