I read some literature proposing sample size calculations for a study on a new prognostic factor, e.g. a new biomarker and its association with 1 year all-cause mortality.
Example: Schmoor, C., Sauerbrei, W. and Schumacher, M. (2000), Sample size considerations for the evaluation of prognostic factors in survival analysis. Statist. Med., 19: 441-452.
I thought this or other formulas could be used before data collection to determine how many patients to include. I was wondering about two topics.
If we collected data already, and want to investigate a new prognostic factor (which is frequently done with biobanks), is there any method that does not have the problem of a post hoc sample size calculation?
As these are usually explanatory models and we adjust based on subject matter knowledge, what if I am still worried of overfitting and want to use less degrees of freedom (all sample size formulas I know are for prediction models).
Be sure that “using less degrees of freedom” does not use standard statistical assessments to remove “unimportant” associations from the model. Things like unsupervised learning approaches are less problematic.
For my exploratory prognostic factor analysis, overfitting considerations for continuous biomarkers were guided by Parmar and Machin (1995) in their book Survival Analysis: A Practical Approach, in which they suggest that the number of candidate predictors in an adjusted survival model should be limited such that it does not exceed the fourth root of the number of observed events, or alternatively that there should be approximately 15–20 events per variable included in the model. However, these rules of thumb are presented as practical guidelines rather than being derived from a formal statistical derivation or simulation-based justification. In addition, the sample size formula developed by Schmoor et al. was derived under the assumption of proportional hazards, which may not hold for biomarkers in real-world settings.