Missing data in logistic regression models

Dear all,

I’d be grateful for your thoughts on designing logistic regression models with incomplete data - but when the data is ‘appropriately’ missing.

My team and I are modelling modifications to cancer treatment that occurred during the COVID pandemic - not from direct COVID infections, but in attempts to mitigate the vulnerability by cancer patients by deferring care thought to increase the susceptibility to COVID eg chemotherapy, immunotherapy.

Dataset is across a patient cohort with varying tumour types and multiple treatment modalities (eg chemotherapy, immunotherapy, radiotherapy) - however not all modalities are applicable or relevant to each patient. For a given patient: each modality is coded as

  • no.mod: treatment not modified = continued as planned
  • mod: treatment modified in some capacity (eg delayed, dose reduced or omitted entirely)
  • NA: not applicable: that modality was not clinically relevant for that individual patient, so that modality can neither be modified or not modified

For a given patient, each modality may or may not be modified independently eg their chemotherapy was modified, their radiotherapy proceeded as planned.

Goal: metric of how likely each modality is to be modified (or not), after controlling for other clinical variables (eg age, performance status, tumour type) - odds ratio, 95% CI for each modality.

I had planned to use a binomial logistic regression model with outcome modified vs not, however the (appropriately) missing data precludes such a GLM across the whole cohort.

Any thoughts on how to proceed would be much appreciated!

Representative toy data below:

uid age ps tumour immuno chemo radio
1 25 0 breast mod NA NA
2 30 1 breast no.mod no.mod mod
3 35 1 lung NA mod no.mod
4 40 2 colon mod NA no.mod
5 45 2 colon NA mod no.mod

Data dictionary:
uid: patient unique identifier
age: patient age
ps: Performance status (PS) - validated measure in cancer of fitness or physical reserve
tumour: tumour type

(Will be working in R)

Think about which conditions you’d like to estimate the probability of. For example do you want P(modified | applicable) or do you want P(modified)? And is it appropriate to think of a 3-level ordinal outcome where NA is in the middle? (I doubt it but worth a pause).

Hi Prof,

Thanks - yes, I’m interested in P(modified | applicable) for each treatment modality (chemotherapy, immunotherapy etc).

For example:
-Among patients on chemotherapy, the probability of having that treatment modified is ___ .

  • Among patients on immunotherapy, the probability of having that treatment modified is ___ .

I think this split by modality is important (rather than aggregating all treatments a patient received), as patients often receive therapy with multiple modalities - but these might not all have been modified (eg chemo discontinued, immunotherapy not changed).

I don’t think a 3-level ordinal is needed as the NA level would not have a meaningful clinical interpretation.

1 Like