Bootstrapped variable selection in the presence of competing risk

OttoPat · January 9, 2026, 7:17pm

Dear DataMethods Community,

As a preface: I’m a 1st year medical PhD student and have limited statistical support in my department so forgive me for any potential caveats in my questions

I am currently working on developing a clinical prediction model for 12-month risk of a certain outcome (X) in the presence of competing risks. My dataset comprises of ~50000 individuals and ~40 predictors (both continous and categorical, which have been selected by an expert panel). My supervisors have outlined a protocol that in performs AIC-based step-wise backwards selection. In brief it is as follows:

Initial model

Include all candidate predictors in initial fine-gray model (e.g. fastcmprsk package)
Perform backwards elimination based on AIC

Imputation and bootstrapping

Create 30 imputed sets
For each imputation, generate ~500 bootstrap samples and perform backwards selection (bootstrapped loop (for example using fastcmprsk), AIC based selection)
Retain predictors in the initial model that are present in over > 75% (adjustable) accross all bootstrap samples.

Internal validation

In each imputation set, repeat bootstrap sampling and full model development
Model performance (C-index) in bootstrap as well as corresponding orignal imputation set (or should this be the original complete case data?) is used to estimate optimism
Pool 30 optimism estimates
Estimate calibration slopes, average 500 slopes for each imputation set, combine means to obtain uniform shrinkage factor.

I’ve been exploring this approach and the practical implementation and the more I read into it, the more I question whether this is the way to go.

From what I can gather from the RMS book and related questions, stepwise backwards selection knows its limitations. In addition, the computational burden is quite heavy. So the question is: is it valid for my use case?

As an alternative, I have considered a bootstrapped LASSO Fine-Gray approach. I’ve written a script that succesfully does this. However, I am not sure on how to proceed from there:

A) I could use a similar approach as in the backwards selection to extract the retainment-frequency of each variable and include only those with a freq of > 75% in my final model. However, I feel like there are also several caveats to this approach as it seems to only retain information on the selected variables and not so much on the penalization?
B) I could also consider performing optimism correction by calculating performance on the bootstrap set and the original imputation set (or should it be the original data prior to imputation). Overall, i don’t really understand what would be the best way to use the results of my bootstrapped LASSO…

Hoping you can shed some light on this!

Thanks in advance,

Otto

f2harrell · January 10, 2026, 12:48pm

Please provide more background information. What is the main outcome and its frequency? What is the competing event? What is the burning clinical question? Why is competing risk the way to proceed, for example why would you be interested in modeling the risk of an event that precedes the different, competing event?

Variable selection is no more valid in the presence of competing risk than it is in the simple case, so I’m at a loss of why you are doing it. And bootstrap selection frequencies just mimic AIC and is also distorted by collinearities. You’re problem is in all likelihood more suitable for data reduction (unsupervised learning) followed by fitting a single model on reduced dimensions.

OttoPat · January 10, 2026, 7:00pm

A bit more background. The goal is to develop a prediction model for the 1-year risk of cardiovascular events in incident cancer patients with non-cardiovascular death as a competing event. The reason for this project is that cancer patients are and an elevated risk of CVD and risk factor specific influence on CVD likely differs from general population (demonstrated by prediction models developed on general population showing limited performance in cancer populations). The goal is to develop this model and subsequently externally validate it in a different cohort.

With respect to your question regarding the main outcome and its frequency: in this dataset (50000 indiv) after one year the main event has occurred in ~700 individuals, competing event ~6500. Data missingness generally is relatively low (for non-complete variables ranging from 2.5% to 10 %).

I hope I have provided a bit more clarity, thanks for the help!

f2harrell · January 11, 2026, 1:34pm

Just to complete a discussion of background issues, you can easily censor on non-related events to get cause-specific hazard functions, but as you know, getting absolute risks involves more than that to obtain cumulative incidence estimates. With competing risks of the type you’ve outlined, a cumulative risk estimate would be for example the probability that by 1 year a patient will have a cardiovascular event that precedes a cancer death (or other non-CV death). Is this estimand really interpretable and clinically useful? It’s a kind of conditional probability. Contrast that with a multistate transition model that would estimate unconditional (except for covariate) things such as the probabilities of

having a CV event or a non-CV death by 1y
having a CV event but not a non-CV death by 1y
having no event within 1y, i.e., having no CV event for 1y and being alive for 1y

OttoPat · January 11, 2026, 2:30pm

Hmm yes I’ve been struggling a bit with the clinical interpretation of the fine-gray in the context of predictive modelling. So if I understand correctly, using fine-gray I would model the risk that ATE is the first event rather than the risk of ATE in the 1st year? Given that my competing event (death due to non-ATE) is lethal, i’m having trouble understanding how those two things truly differ. (It seems the more I read about Fine-Gray, the less i understand ;). My main goal is a prediction model for the 12 month ATE risk following cancer diagnosis. So then you say multi-state is more logical?

Assuming I would take the multistate approach, I could step away from Fine-Gray if I understand correctly. The question regarding variable selection still remains then and if i understand correctly you would suggest unsupervised learning data reduction rather then backwards selection or LASSO? What aspect of my data determines this suitability?

Your help is greatly appreciated!!

ESMD · January 11, 2026, 3:15pm

Hope you don’t mind a few clinically-oriented questions/comments from a family physician (?)

A huge number of clinical prediction tools have been developed over the past few years in many fields, yet only a very small proportion of these will ever see actual clinical application. For this reason, it seems essential, as a first step (prior to getting bogged down by statistical considerations), to interrogate (rigorously) the clinical rationale and goals for your project.

Questions:

Who proposed the topic for your project (you or your supervisor)? If your supervisor proposed the project, is he/she a practising oncologist?
Did you do a scoping exercise with several practising oncologists to discuss what you hope to achieve with your project and ask whether/how your results might ultimately be applied in a clinical setting for patients with different types of cancer, (and therefore different treatments) and different prognoses?

Comments:

For cancer patients with poor/intermediate prognoses, I’m not sure that a prediction tool like this would see much clinical use (if any). Patients with aggressive cancers (and their physicians) don’t tend to prioritize aggressive control of cardiovascular risk factors, since the purpose of tight control is to help people live to an advanced age;
For cancer patients with better long-term prognoses (especially younger patients), identifying that this group has an elevated future cardiovascular risk (conferred either by their underlying disease or its treatment) could be of interest. However, it’s plausible that many chronic diseases could confer an increased CV risk- the most prominent example is chronic kidney disease (see below). I’d suggest that you investigate whether this is also true for fields other than nephrology (e.g., patients with autoimmune/rheumatologic conditions or other conditions associated with chronic inflammation). It would be important to ask whether researchers in those fields have been able to get a “modified” cardiovascular clinical prediction tool approved for use in practice in their respective fields AND whether there has been clinical uptake of such tools. If CV risk prediction tools already exist for other chronic disease areas (for conditions with better long-term prognoses than cancer) but are not actually being used by clinicians, then it will be important to ask why this is the case (asking clinicians would be the fastest way to get your answer)…
Historically, the list of CV “risk factors” included in clinical risk calculators has remained fairly stable. This website shows the risk factors included in a few different calculators: https://shareddecisions.mayoclinic.org/statindecisionaid/ As you know, the main information that gets entered into these calculators is age, sex, history of treated hypertension (or not), blood pressure, lipid level, smoking status, and diabetes status;
Certain chronic medical conditions like CKD are known to be associated with increased future CV risk, mainly (but perhaps not entirely) because many of the risk factors for CKD (e.g., HTN, DM) overlap with those for CAD. In spite of this association, however, some controversy remains regarding the best approach to primary prevention in this population. Specifically, there is a question as to whether we should continue to treat CKD as a coronary heart disease (CHD) “risk equivalent” (with automatic statin initiation) or whether patients with CKD should instead have their future risk estimated using a risk calculator that incorporates consideration of creatinine level (e.g., the 2023 AHA risk calculator). Note that, up until fairly recently, popular CV risk calculators used to guide decisions around statin initiation for primary prevention did NOT include consideration of renal function. If you have a subscription to Uptodate, have a look at the following sections: “chronic kidney disease and coronary heart disease” (look under the heading “CKD as a CHD risk equivalent” heading) and “Lipid management in patients with nondialysis chronic kidney disease” (note the variability in “Primary Prevention” recommendations…).

Not sure if any of this is helpful- good luck with your research.

OttoPat · January 11, 2026, 5:36pm

Hi Erin,

Thanks for weighing in on this! The project was proposed by my supervisors who are both an internal medicine and an oncologist. However, this does not make your concerns regarding application any more valid. From what I can gather in my thus far sparse experience as a researcher on the crossroads of oncology and vascular medicine, is that oncologists are predominantly survival focussed (often, rightfully so) and as such cardiovascular risk management is often not top-of-mind. Consequently, a lot of onco-cvd research is done that is largely ignored or left unimplemented by oncologists As such, and I fully agree with you on this point, it is vital to scope oncology departments to elucidate their view on utility and perceived hurdles to application, and this is something we are actively working on.

I’ll look further into your suggested CKD reading, very helpful! Thanks once again!

f2harrell · January 12, 2026, 12:40pm

Correct regarding variable selection. Some of the things that determine that data reduction is a better choice include ability to explain effects as “themes” rather than arbitrarily selecting one of a set of competing collinear variables, better predictive discrimination, and probably better calibration. Stopping rules are also a bit more straightforward.

Join the club. I spent 20 years trying to understand F-G cumulative incidence. Yes you would be estimating the probability of ATE as a first event within 12m. What “ATE as a first event” means is cloudy to me. Multi-state models allow estimation of unrestricted (unconditional) outcomes. This allows provision to the patient of a series of useful estimates, e.g. “you have a risk of xx of having A without having B, a risk of yy of having A with B, etc.”.

There are also non-predictive analyses that help in understanding:

use a nonfatal event as a time-dependent covariate in predicting other events to see how they co-express
take all patients having any events and do a backwards prediction of which event happened, as in the prediction of cause of death I did in 11 Binary Logistic Regression Case Study 1 – Regression Modeling Strategies