I need to create a model for survival analysis that considers variables available for the whole population and for subsets of the population. I have 3 types of variables:
- variables measured in all patients: age, gender, stage, nodal surgery (Y/N), etc.
- variables only available if patients had nodal surgery: ECE (y/n), number of nodes harvested, number of positive nodes, etc.
- variables only available if patient had at least 1 positive node after nodal surgery: maximum nodal size.
My outcome variables are time to event (overall survival, progression free survival, time to local failure).
Those variables are known or suspected to be associated with poor time to event outcomes in patients with head and neck cancer but head and neck cancer comprises multiple primary sites (oropharynx, larynx, oral tongue, etc) so we don’t know whether those variables are indeed associated with poorer outcomes for a specific primary. For example, HPV status is well known to be important for oropharynx but not other sites. In my situation, all patients have the same primary site.
If I only had one variable from 2) I could just create a third level (e.g. ECE yes/no/no nodal surgery) or potentially consider 0 nodes for numeric variables but if I do that for all variables I will have problems with colinearity if include all of them.
Anyone had to perform an analysis in a similar situation?