Sample size in prognostic factor research

Dear all,

I read some literature proposing sample size calculations for a study on a new prognostic factor, e.g. a new biomarker and its association with 1 year all-cause mortality.

Example: Schmoor, C., Sauerbrei, W. and Schumacher, M. (2000), Sample size considerations for the evaluation of prognostic factors in survival analysis. Statist. Med., 19: 441-452.

I thought this or other formulas could be used before data collection to determine how many patients to include. I was wondering about two topics.

  1. If we collected data already, and want to investigate a new prognostic factor (which is frequently done with biobanks), is there any method that does not have the problem of a post hoc sample size calculation?
  2. As these are usually explanatory models and we adjust based on subject matter knowledge, what if I am still worried of overfitting and want to use less degrees of freedom (all sample size formulas I know are for prediction models).

Thanks in advance!

Koray

I’m not sure I can address your questions but this recent paper may be of interest if you haven’t already seen it: Design aspects for prognostic factor studies - PMC

1 Like

Be sure that “using less degrees of freedom” does not use standard statistical assessments to remove “unimportant” associations from the model. Things like unsupervised learning approaches are less problematic.

In terms of dealing with a possibly inadequate sample size, the best way to deal with that when patients cannot be added is to present confidence intervals for quantities of interest, such as intervals for interquartile-range effect ratios and especially for relative explained variation due to each predictor. For the latter see 5  Describing, Resampling, Validating, and Simplifying the Model – Regression Modeling Strategies and Challenges of High-Dimensional Data Analysis.

For my exploratory prognostic factor analysis, overfitting considerations for continuous biomarkers were guided by Parmar and Machin (1995) in their book Survival Analysis: A Practical Approach, in which they suggest that the number of candidate predictors in an adjusted survival model should be limited such that it does not exceed the fourth root of the number of observed events, or alternatively that there should be approximately 15–20 events per variable included in the model. However, these rules of thumb are presented as practical guidelines rather than being derived from a formal statistical derivation or simulation-based justification. In addition, the sample size formula developed by Schmoor et al. was derived under the assumption of proportional hazards, which may not hold for biomarkers in real-world settings.

1 Like

The STROBE provides some explanations for this issue. Also, don’t calculate the sample size after data collection.

10 Study size: Explain how the study size was arrived at.

1 Like

See also the 2 papers by Richard Riley and co-authors (including myself) in Statistics in Medicine.

why not? formulas could be used in restropective data to choose how many parameters one can consider

3 Likes

thank you all for the input!

actually this was the intention of my question, i do not want a sample size calculation to justify my observation is powered etc, i wanted it to choose the number of confounders to adjust without overfitting (in an explanatory model to show the association of a prognostic factor with an outcome).

For now my strategy was to include established and well-known risk factors (e.g. age and comorbidities) and then it is okay to adjust too much (more than the rule of thumb with 10 EPV). > here it would be amazing to have something like the pmsampsize package from Riley et al.

I see that Frank Harrell has already suggested the papers that directly address your question, Richard Riley et al in the BMJ and their updated thoughts on the matter after years of deliberation, Riley et al, arxiv 2025.

I should emphasize that the only defensible way to test a new prognostic factor is to add it into a prediction model that includes other known, easily obtainable predictors and see if it adds value. So sample size formulas for prediction models are in fact what you need.

Having developed several prediction models that are in daily clinical use, I would add a point that I didn’t necessarily feel that was emphasized as much as I would have liked in the prediction modeling literature that I read before doing my own: consider carefully what information the users of your prediction model will have access to at the time of deciding to use your prediction model and consider the cost of obtaining the necessary predictors.

I would suggest you do a formal sample size calculation, as described above in the papers by Riley at al, and use this to obtain the maximum candidate predictor variables you can formally consider. I would then decide which predictors to fill your allotted budget by choosing among the predictors that clinical expertise and prior literature have suggested are important, considering availability of the predictor in contemporary practice (will the user have to obtain predictors they wouldn’t normally measure just to use your model), cost of obtaining the predictor (would obtaining the predictor be costly in time, complexity, pain or discomfort, or money), and acceptability of the predictor to users (will the user understand the relevance of the predictor to the decision that needs to be made). Obviously you can’t even consider predictors that cannot be available to the user at the time of using your model (such as response to chemotherapy), but I don’t think I need to emphasize this.

Finally, consider deeply what your aim is. I am both a clinician and a developer of clinical prediction models and the most common error I see is prediction modeling development is not considering how or why a clinician would consider using your model. Your stated aim is severely testing the added value of a new biomarker in the prediction of one year mortality. You have already done what the majority of researchers refuse to do – decided to test your biomarker extensively in the correct framework. However, consider how or why a clinician would then use this biomarker? What decision does one year mortality inform?

Every prediction model and every biomarker in clinical use is couched in some kind of clinical decision (whether this is justified or not). The results of a model should inform a decision. If NT-proBNP is elevated in the setting of dyspnea, this pushes me in the direction of diuretics. If the CHA2DS2-VASc score is elevated predicting a higher risk of stroke in atrial fibrillation, this pushes me to prescribe anticoagulants.

What does your prediction model push clinicians to do? What will the elevated biomarker or normal levels of the biomarker inform clinicians of? If you have trouble answering these questions, then the likelihood of the model or biomarker being used clinically is small.

3 Likes

dear elias,

thank you for your thoughts, discussions, and elaborations on other topics.

The nuance of my question is that it is about causal inference and associations irrespective of overall prediction. Explanatory/causal models (with the goal of investigating one specific prognostic factor) should be treated very different than prediction models, for instance you can not adjust for mediators.

I asked Richard himself during his prediction model course, if the sample size calculations of prediction models (the ones you cited) can be used for these explanatory models, and he clearly answered it is not possible.

The formulas I cited in the very beginning were also recommended to me by Richard, he also posted them on his website: https://www.prognosisresearch.com/guidance-prognostic-factors

And these formulas certainly apply when planning a new study and how many patients to recruit.

However, we already have a biobank with a certain number of patients established. I already believed retrospectively using these formulas does not make sense, as also JiagiLi stated. Frank suggested that then it is crucial to present confidence intervals.

Regarding the number of confounders that can be adjusted for without using penalization (shrinkage) see this.