The second paper from Riley et al above, pertaining to prediction models with either binary or time to event outcomes, as appears to be relevant to your project, demonstrates that there is no “one size fits all” approach to sample size estimates for stable prediction models. It is a process.
The oft quoted rule of thumb of using 10 or perhaps 20 events per covariate degree of freedom does not cover all scenarios.
You should read that second paper to get a sense of those dynamics that affect the sample size estimates.
In the absence of going through the exercise described and referenced in Figure 1 in that paper, you would want to take a conservative approach to estimating the needed sample size, which risks the possibility that you get too much data by being conservative, while still risking not getting enough data given the complexity of your intended model, and perhaps needing to simplify your model.
Again, that the data are already available, as opposed to having to collect the data prospectively, removes the time, resource and cost hurdles from data collection. Thus, there is no disincentive to being conservative and going after more data versus less.
One approach might be to cite the Riley et al paper as a source, and indicate that in the interest of being conservative, based upon the paper, you might use 25 or even 30 events per covariate degree of freedom, which would be at or above the high end of the numbers quoted there, though still subject to being too low for any single model development scenario.
Then you need to estimate the incidence of death in your target cohort and the potential complexity of your prediction model, to yield an estimate of your target sample size.
So, for example, say that the estimated incidence of death in your target cohort is 5%. You need 20 patients to get 1 event.
Say that your model will include 30 covariate degrees of freedom (CDF), where that number is derived from a mix of continuous variables, perhaps K-level factors, interaction terms, and regression splines.
Thus, to use 25 events per CDF as a target, you would need 750 events (25 * 30).
With a 5% incidence of death, to get 750 events, you will need 15,000 patients.
So, take the above, and work through scenarios that conform to your assumptions and come up with an estimate of your target sample size.
Then, another question will be, using the 15,000 figure above, do you take a random sample of 15,000 from the target cohort, or do you take a consecutive sample of 15,000 from the target cohort? My own preference would be the latter, given the vagaries of random sampling, but you may have other motivations. So, once you get an estimate of the target sample size, you also need to define your sampling strategy, presuming that the technical resources that you are working with have the ability to implement your approach.