Sample size in prognostic factor research

Dear all,

I read some literature proposing sample size calculations for a study on a new prognostic factor, e.g. a new biomarker and its association with 1 year all-cause mortality.

Example: Schmoor, C., Sauerbrei, W. and Schumacher, M. (2000), Sample size considerations for the evaluation of prognostic factors in survival analysis. Statist. Med., 19: 441-452.

I thought this or other formulas could be used before data collection to determine how many patients to include. I was wondering about two topics.

  1. If we collected data already, and want to investigate a new prognostic factor (which is frequently done with biobanks), is there any method that does not have the problem of a post hoc sample size calculation?
  2. As these are usually explanatory models and we adjust based on subject matter knowledge, what if I am still worried of overfitting and want to use less degrees of freedom (all sample size formulas I know are for prediction models).

Thanks in advance!

Koray

I’m not sure I can address your questions but this recent paper may be of interest if you haven’t already seen it: Design aspects for prognostic factor studies - PMC

Be sure that “using less degrees of freedom” does not use standard statistical assessments to remove “unimportant” associations from the model. Things like unsupervised learning approaches are less problematic.

In terms of dealing with a possibly inadequate sample size, the best way to deal with that when patients cannot be added is to present confidence intervals for quantities of interest, such as intervals for interquartile-range effect ratios and especially for relative explained variation due to each predictor. For the latter see 5  Describing, Resampling, Validating, and Simplifying the Model – Regression Modeling Strategies and Challenges of High-Dimensional Data Analysis.