Multivariable model with variables missing in select patient subsets

I need to create a model for survival analysis that considers variables available for the whole population and for subsets of the population. I have 3 types of variables:

  1. variables measured in all patients: age, gender, stage, nodal surgery (Y/N), etc.
  2. variables only available if patients had nodal surgery: ECE (y/n), number of nodes harvested, number of positive nodes, etc.
  3. variables only available if patient had at least 1 positive node after nodal surgery: maximum nodal size.

My outcome variables are time to event (overall survival, progression free survival, time to local failure).

Those variables are known or suspected to be associated with poor time to event outcomes in patients with head and neck cancer but head and neck cancer comprises multiple primary sites (oropharynx, larynx, oral tongue, etc) so we don’t know whether those variables are indeed associated with poorer outcomes for a specific primary. For example, HPV status is well known to be important for oropharynx but not other sites. In my situation, all patients have the same primary site.

If I only had one variable from 2) I could just create a third level (e.g. ECE yes/no/no nodal surgery) or potentially consider 0 nodes for numeric variables but if I do that for all variables I will have problems with colinearity if include all of them.

Anyone had to perform an analysis in a similar situation?

1 Like

what is the research question, or what prompted the data collection? perhaps 2) are outcomes?

Thanks for replying. I edited the question to make that clearer.

cheers. For the survival outcome, when does the clock start for defining survival times? maybe diagnosis date? and severity of disease, surgery, and primary site are wrapped up in diagnosis?

Survival measured from surgery and all variables collected at time of surgery or at clinical consultation prior to surgery.

The first impression I get is whether it makes clinical sense to mix such different situations. If the patient has no nodal dissection sometimes it is because they are very localized tumors. The question you should ask yourself is whether it makes sense to mix localized and locally advanced tumors. For example, one may wonder what prognostic effect micrometastases have on lymph nodes. It is a very clinically useful question because it helps to determine whether or not radiotherapy is needed. But that variable is simply irrelevant in a stage I, localized. Does a prognostic model mixing the variables of locoregional, advanced, and localized cancer make clinical sense? I’m not sure, but they seem very different worlds that don’t really make sense to mix.

Poor prognostic variables from more advanced tumors will eclipse more subtle variables from more localized tumors. I would deeply rethink the clinical question you are asking yourselves.
Also remember that you have time-dependent variables, at least those that come from surgery, difficult to mix with baseline factors.

Thanks for your input! There are a few question they are trying to address with this dataset and one is about the management of the nodes. I will ask the clinicians about the localised vs locally advanced.

Th data is from a surgical series. All patients had surgery to the primary with some also having nodal dissection together with the primary surgery. Survival is being assessed from surgery so I am not sure which variables you are considering as time-dependent.

I’m sorry, I thought some patients had surgery and others had not, hence my comment about time-dependent variables.
With regard to the other matter, allow me to expand briefly on my previous comment, in case my oncologist’s vision could help you.
In your question, you allude to the fact that you have a database in which patients have received different therapies, which usually happens depending on their TMN tumor stage. This confronts your statistical analysis to several different clinical scenarios.
In very localized tumors (TNM I) the objective is usually to preserve functionality, and therefore, the research tends to seek prognostic factors that guide the aggressiveness of therapies. These are supposed to be your patients without a certain type of surgery.
In more advanced tumors, the clinical question is usually the opposite, to whom the treatment should be intesified to prevent it from relapsing.
These scenarios are so different that I can hardly think of the reason for trying to mix them, or how a joint analysis could be useful to make some kind of decision.
Therefore, my answer, before looking for complex models, would be to tell clinicians what they specifically need to know about patients.

I think it would be best to analyze each subgroup separately, because they really are very different scenarios, as I said in the previous comment. That way your problem disappears.