Justification of sample size for the generation of a prediction model

Dear all,

I plan a retrospective analysis of patients who visited our outpatient clinic. The data will be extracted from our hospital information system. One specific research question is to generate a prognostic model to estimate the risk of death based on continuous variables sampled during a standard examination. Theoretically, I could extract data from more than 50,000 patients.

Now I have to justify the extraction of such a large number of patients against the General Data Protection Regulation.

Is there any epidemiological, methodological, or statistical justification for this?

I suppose that a regression model gets better the more patients and events are available for its generation.

However, I do not have the expert knowledge to substantiate this statement further. Therefore, I would like to start a discussion on this topic and looking forward to your replies.

Thanks, Alexander

If manual data checking is not required there is no reason not to extract all available patient data. Then you can check the adequacy of the number of deaths using one of the prediction model sample size papers by Richard Riley on which I am a co-author.

To follow up on Frank’s comments above:

For ease, the papers that I believe Frank is referring to are:

Minimum sample size for developing a multivariable prediction model: Part I – Continuous outcomes


Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes

With respect to the GDPR, the issue that is relevant there is the notion of data minimization, which to note from one source, is that you should collect and process only as much data as absolutely necessary for the purposes specified.

To Frank’s point, the relevant data are already available, as opposed to needing to be prospectively collected de novo. Thus, I am in agreement with Frank, in that there is no reason to not take advantage of that, while satisfying relevant data protections vis-a-vis your working datasets.

It is not clear from the description of your project, whether or not you are looking to use the entire cohort from your clinical information system, or if there is a specific clinical sub-group that you are interested in. If the latter, I would tend to focus on describing the inclusion/exclusion criteria for your cohort, as well as the specific data variables that you need, as the parameters for the extracted sample, rather than a specific sample size.

In other words, ideally, you don’t need all patients and all data. You only need the sub-group of patients, and the specific subset of data on those patients, that are relevant to your question.

If that is a reasonable approach, then you should be able to satisfy the GDPR requirements for minimizing the data that you require, notwithstanding any internal policy issues that you may face.

Edited to add: Your inclusion/exclusion criteria should likely also include a relevant time frame (e.g. the past 5 years, 2010 - 2022, …), since it seems reasonable that prior to a relevant time window, clinical practice and/or patient characteristics may have changed, such that the data prior to that time period are no longer relevant to your question regarding predicting mortality outcomes. That can also be another filter to meet the GDPR minimum data requirement.

1 Like

Thanks for your answers!

Personally, I would like to extract as many patients as possible from the clinical information system.
The protection of data privacy issue arising is, however, that we need to link personal data from the clinical information system with the national death registry. From a data protection point of view, the data protection risk attenuates with decreasing number of patients. Therefore, I have to justify the need for a huge sample size from a methodological point of view.

I will look into the papers later. Are there any rules of thumb with regard to the precision of a model or R packages / R examples for the sample size calculation?

The second paper from Riley et al above, pertaining to prediction models with either binary or time to event outcomes, as appears to be relevant to your project, demonstrates that there is no “one size fits all” approach to sample size estimates for stable prediction models. It is a process.

The oft quoted rule of thumb of using 10 or perhaps 20 events per covariate degree of freedom does not cover all scenarios.

You should read that second paper to get a sense of those dynamics that affect the sample size estimates.

In the absence of going through the exercise described and referenced in Figure 1 in that paper, you would want to take a conservative approach to estimating the needed sample size, which risks the possibility that you get too much data by being conservative, while still risking not getting enough data given the complexity of your intended model, and perhaps needing to simplify your model.

Again, that the data are already available, as opposed to having to collect the data prospectively, removes the time, resource and cost hurdles from data collection. Thus, there is no disincentive to being conservative and going after more data versus less.

One approach might be to cite the Riley et al paper as a source, and indicate that in the interest of being conservative, based upon the paper, you might use 25 or even 30 events per covariate degree of freedom, which would be at or above the high end of the numbers quoted there, though still subject to being too low for any single model development scenario.

Then you need to estimate the incidence of death in your target cohort and the potential complexity of your prediction model, to yield an estimate of your target sample size.

So, for example, say that the estimated incidence of death in your target cohort is 5%. You need 20 patients to get 1 event.

Say that your model will include 30 covariate degrees of freedom (CDF), where that number is derived from a mix of continuous variables, perhaps K-level factors, interaction terms, and regression splines.

Thus, to use 25 events per CDF as a target, you would need 750 events (25 * 30).

With a 5% incidence of death, to get 750 events, you will need 15,000 patients.

So, take the above, and work through scenarios that conform to your assumptions and come up with an estimate of your target sample size.

Then, another question will be, using the 15,000 figure above, do you take a random sample of 15,000 from the target cohort, or do you take a consecutive sample of 15,000 from the target cohort? My own preference would be the latter, given the vagaries of random sampling, but you may have other motivations. So, once you get an estimate of the target sample size, you also need to define your sampling strategy, presuming that the technical resources that you are working with have the ability to implement your approach.

1 Like

A quick follow up here. I decided to check to see if anyone has implemented the methodologies in the Riley et al papers, and did find two.

The first is an online calculator at the Cleveland Clinic:


The second is an R package on CRAN called ‘psampsize’ that appears to be written and maintained by the paper authors:


In order to use either one, you will still need to estimate the other required parameters as per the process described in the Riley et al paper, albeit these can save time in performing some of the calculations.

So, you might want to review those.


Thanks for your extensive replies! Your sample size calculation using the “X events per CDF” approach is realistic for me to perform. Maybe it is justifiably to increase X a little bit? Should I double the sample size to provide enough patients and events for the training as well as validation cohort?


It seems like you have a target number in mind. If that is the case, then you have the basic framework above to work backwards from that number to define the foundation to justify that number.

In terms of validation, split sample validation is generally not preferred, in deference to using bootstrap based validation methods on the entire sample. Thus, there is not a need to double the sample size a priori.

You might want to get a copy of the second edition of Frank’s book:


where he goes into detail on those aspects of model development, using his R packages as a foundation.

As a high level overview, you might want to review:


from Frank’s online BBR documentation.

1 Like