Sample size in prognostic factor research

Dear all,

I read some literature proposing sample size calculations for a study on a new prognostic factor, e.g. a new biomarker and its association with 1 year all-cause mortality.

Example: Schmoor, C., Sauerbrei, W. and Schumacher, M. (2000), Sample size considerations for the evaluation of prognostic factors in survival analysis. Statist. Med., 19: 441-452.

I thought this or other formulas could be used before data collection to determine how many patients to include. I was wondering about two topics.

  1. If we collected data already, and want to investigate a new prognostic factor (which is frequently done with biobanks), is there any method that does not have the problem of a post hoc sample size calculation?
  2. As these are usually explanatory models and we adjust based on subject matter knowledge, what if I am still worried of overfitting and want to use less degrees of freedom (all sample size formulas I know are for prediction models).

Thanks in advance!

Koray

I’m not sure I can address your questions but this recent paper may be of interest if you haven’t already seen it: Design aspects for prognostic factor studies - PMC

1 Like

Be sure that “using less degrees of freedom” does not use standard statistical assessments to remove “unimportant” associations from the model. Things like unsupervised learning approaches are less problematic.

In terms of dealing with a possibly inadequate sample size, the best way to deal with that when patients cannot be added is to present confidence intervals for quantities of interest, such as intervals for interquartile-range effect ratios and especially for relative explained variation due to each predictor. For the latter see 5  Describing, Resampling, Validating, and Simplifying the Model – Regression Modeling Strategies and Challenges of High-Dimensional Data Analysis.

For my exploratory prognostic factor analysis, overfitting considerations for continuous biomarkers were guided by Parmar and Machin (1995) in their book Survival Analysis: A Practical Approach, in which they suggest that the number of candidate predictors in an adjusted survival model should be limited such that it does not exceed the fourth root of the number of observed events, or alternatively that there should be approximately 15–20 events per variable included in the model. However, these rules of thumb are presented as practical guidelines rather than being derived from a formal statistical derivation or simulation-based justification. In addition, the sample size formula developed by Schmoor et al. was derived under the assumption of proportional hazards, which may not hold for biomarkers in real-world settings.

1 Like

The STROBE provides some explanations for this issue. Also, don’t calculate the sample size after data collection.

10 Study size: Explain how the study size was arrived at.

1 Like

See also the 2 papers by Richard Riley and co-authors (including myself) in Statistics in Medicine.