Conceptual question about survival analysis vs logistic regression: when is one more appropriate than the other?

AfshanChudry · November 16, 2019, 7:54pm

Hi everyone! I have a hospitalization dataset that spans an 18 year period. Because of this some individuals are hospitalized multiple times. I am attempting to assess the association of diabetes exposure with in-hospital death among patients hospitalized with a salmonella infection, aka everyone in the dataset has this infection (I do not have any info on whether death occurred outside the hospital, I only know if an individual died in the hospital). Additionally I also have the length of stay of each hospitalization event. I am wondering if a survival analysis would be more appropriate in this case (perhaps a cox proportional hazards model, I have length of stay to act as my time variable?) rather than a logistic regression model? What benefit conceptually does using a cox model give?

First, one aspect I am a bit iffy on is that since my outcome is explicitly in-hospital mortality rather than mortality in general, is the censoring framework really appropriate in this scenario? Since censoring implies that the event (death) could have happened but is simply not observed due to loss to follow up/the end of the study, I am wondering if this is even appropriate in my scenario since by definition they can no longer have an in-hospital death once they leave the hospital and I am not trying to (and cannot due to limitations of the data) assess mortality outside the hospital.

Secondarily, I am wondering if a survival analysis would also be inappropriate since some individuals may have had a hospitalization before the time span of the dataset (aka before 2000) and thus I would be missing these previous hospitalizations. Basically I am afraid that my time origin isn’t actually the start of the risk period for some of the individuals.

For both these reasons I am wondering if logistic regression would be the more appropriate method.

Any advise would be much appreciated!

albertoca · November 16, 2019, 9:21pm

I’m a clinician, not a statistician. However, one aspect I have thought of in similar scenarios is that censoring is clearly informative. The one who is discharged from hospital does not die in the hospital, and therefore, that case may not be distinguished from loss of follow-up. I would use logistic regression since the observation period from the onset is expected to be short. Let’s see what other people think.

f2harrell · November 17, 2019, 1:11am

Length of stay is an outcome variable. To analyze it one can consider the outcome variable as time to successful discharge, and censor on death.

AfshanChudry · November 17, 2019, 3:17am

So in that sense it would likely not be appropriate to use it as a time variable to assess in-hospital mortality via Cox PH regression? I supposed the survival analysis would be more appropriate if my outcome of interest was length of stay rather than death?

f2harrell · November 17, 2019, 3:50am

You can fit two models, one for time to death, one to time to successful hospital discharge.

MSchwartz · November 18, 2019, 1:32pm

Hi,

I have been thinking about this question, and have some questions/comments of my own:

What is the distribution of length of stay for these patients? Is it typically on the order of days, weeks or months, and are there presumably high outliers? The basis of my question is that if the length of stay is typically short, say less than 30 days, it might be reasonable to use logistic regression, with the outcome being a dichotomous discharge status (alive or dead) since the relative granularity of time is low.
As Frank notes, length of stay is an outcome variable, not an a priori covariate measured at admission. That is, length of stay cannot be known at admission, only once the patient is discharged. Thus, if you use logistic regression, length of stay should not be a covariate, but if you use a Cox model, it would be your time to event metric.
Keep in mind that if you should want to generate predictions from the model, the Cox model will give you probabilities of discharge alive (or death) as a function of time, whereas a logistic model will give you the probability of the outcome at discharge as a dichotomous event, irrespective of time. The underlying question that you are trying to answer might influence the approach that you take.
Since you have 18 years worth of data, and some patients may have multiple observations, presuming each observation meets your inclusion criteria of having an active salmonella infection at admission, do you need to consider time varying covariates in your model for patients where some baseline characteristic changed from one admission to the next?
An extension to my prior point above, is the need to consider multiple observations per patient, in an appropriate modeling framework. The discussion has focused on Cox and logistic models, but you may need to think about mixed effects variations of both, a robust covariance matrix adjustment, or possibly a GEE model, depending upon the approach you elect to take. Multiple observations per patient will increase your effective sample size, resulting in standard errors that are too small. I am not an expert in mixed effects models, but from what I have seen, there are general rules of thumb suggesting a minimum cluster size of 3, which I suspect your observations do not meet. Thus, a robust covariance adjustment, such as using Frank’s robcov() function in his RMS package for R, might make more sense here.
With the 18 year time frame for the observations , what changes in the practice of medicine may have occurred over that time frame, such that, for example, the probability of death in the early patients is higher than in the later patients, because of improvements in the treatment of the patients that may mitigate the risks of death over time?

Food for thought…

f2harrell · November 18, 2019, 2:39pm

That’s not just food. That’s a feast Marc!

ADAlthousePhD · November 22, 2019, 6:22pm

This is an excellent question.

Here’s an answer that I posted in response to a somewhat similar question

"What about de-coupling it from time, then, and making it something like “discharged alive” or even “discharged alive to home” (to remove patients that were sent to hospice / skilled nursing facilities because they had not truly recovered, had a stroke, etc) or “discharged alive with good functional capacity” (if that could be defined satisfactorily)?

Furthermore, in an acute setting, the risk of censoring or loss-to-follow-up should be extremely low; I should think they’d be able to determine if the patient died in-hospital vs was discharged alive for all patients."

You have hit on a lot of the key issues. In acute settings such as this one, the granularity of the time may be somewhat less important (especially because arguably surviving for 7 days is not better than surviving for 5 days if that meant a few extra days of intensive treatment and suffering for a patient that was already likely not going to survive) and it is very unlikely that you are going to have significant problems with censoring / loss-to-follow-up since presumably you will know for certain the patient’s status at the end of the hospitalization.

I’m a little confused why you’re worried about things that may have happened before the time span of your dataset; this is true of virtually every project, and I don’t see why hospitalizations that happened before your dataset begins are relevant (except as baseline covariates, e.g. “Prior Hospitalization” or something of that nature). This certainly shouldn’t play into your outcome variables.

@MSchwartz brought up some excellent points re: understanding the purpose of your analysis (is this a prediction model? Are you estimating a treatment effect? etc) and the possible presence of recurrent appearances from the same patient (there are reasonable solutions to this whether you decide to treat the outcome for each hospitalization as binary or time-to-event). In your position, I think treating the outcome as a binary “survival to discharge” is fine - even better if you can add some granularity such as “survival to discharge home” to differentiate patients who were “discharged” but in extremely poor health or sent to hospice - and from there you can address the other questions here to figure out your final approach.

f2harrell · November 23, 2019, 2:11pm

Minor technical point: if the analysis includes active patients (not discharged yet) their time to successful discharged will be right-censored in the usual way.

ADAlthousePhD · November 25, 2019, 2:31pm

Agreed; I was (perhaps incorrectly) working under the assumption that analysis would only include patients where the final dispensation is already known.

AfshanChudry · November 28, 2019, 3:30am

@MSchwartz Hi Marc, thank you for the detailed reply! And apologies for taking so long to reply, had a number of exams the last two week and was quite busy!

Most are within a week or two (~90% have a length of stay at 14 days or less and 98% have a length of stay of 30 days or less).
That makes sense!
Yes, that’s something I’ve been ruminating over! Originally I was simply interested in understanding the risk of mortality among diabetic vs non diabetic salmonellosis patients. I’m honestly unsure which would be the better aspect to look at. Since this is for my university requirements I believe my advisor recommended a Cox model as more of a learning exercise for my own edification.
Since the majority of patients only appear to have at most perhaps a month or two in between their hospitalizations I was going to simply treat the covariates as time independent (HIV status and diabetes status in particular). I somewhat doubt many would be diagnosed with either in such a short time and in fact I’ve noticed that some patients seem to be missing their diabetes/HIV status (for example, one patient is marked as diabetic for their first and third hospitalization but not their second one). Since the data is based on billing codes I would imagine in those cases the clinician or medical tech likely simply neglected to re-input the diabetes/HIV ICD code. Thus I ended up marking all the hospitalizations as diabetic or HIV positive if they had at least one hospitalization with the relevant code.
I had considered a mixed effects model however as you noted, around 94% of the patients only have one hospitalization thus the clusters would be too small. I am using SAS to do the analysis (unfortunately haha) and was planning on using the COVSANDWICH option which requests the robust sandwich estimate developed by Lin and Wei for the covariance matrix, thus giving robust standard errors.
This was actually something I was thinking of looking into as an effect moddifer/interaction of some sort. I was thinking of using year of admission and perhaps breaking it up into categories of two year each.

Thank you for your reply! Very helpful!

AfshanChudry · November 28, 2019, 3:54am

Apologies for the double positing but I was unsure of how to reply to multiple posts.

@ADAlthousePhD Thank you for the post Andrew! Would you then suggest simply conducting a logistic regression? Would the outcome at that point be death, discharge home, discharge to hospice etc? In terms of purpose of the analysis, I feel that my question is more of an etiology one rather than a predictive one. I am trying to understand whether diabetes contributes to mortality among salmonellosis patients and if so how strong is the association. Originally I was planning on a logistic regression however my advisor recommended a Cox model simply as a learning exercise for my own edification since I have ran binary logistic analysis before. Would it be possible to integrate your multiple outcomes idea with a Cox model (ie have more than two options, death, discharge home, discharge to hospice, discharge to long term care facility etc.)?

One thing I have realized after reading more literature is that I may actually have an informative censoring issue as @albertoca mentioned earlier. Since being discharged means the patient is no longer at risk of in-hospital death that means the censoring is in fact informative. Interesting paper on this issue:

Resche-Rigon M, Azoulay E, Chevret S. Evaluating mortality in intensive care units: contribution of competing risks analyses. Crit Care . 2006;10(1):R5. doi:10.1186/cc3921

Thus I am thinking I may need to use a competing risks model such as Fine-Grey regression.

There was a rebuttal paper to the above article where the author argues that survival analysis are not necessarily appropriate for ICU mortality studies and that logistic regression should be used in its stead. I’m curious to hear everyone’s thoughts on this. I’m not sure if this would really apply to my scenario or not, since I do not have ICU data and instead have inpatient hospitalizations. Link to the paper:

Schoenfeld, D. Survival methods, including those using competing risk analysis, are not appropriate for intensive care unit outcome studies. Crit Care 10, 103 (2005) doi:10.1186/cc3949

Thank you to everyone for your help!

ADAlthousePhD · December 2, 2019, 4:07pm

I like this Schoenfeld paper, and IMO it still applies to your situation - whether it is “ICU” or “inpatient hospitalizations” the same concerns raised within apply; your question seems most concerned about whether people survive an acute incident. As discussed in the Schoenfeld article, there is probably not much practical difference between surviving 7 days versus 9 days in the hospital (and a slightly longer ‘survival’ may not be the best outcome if it came from an escalation to more intensive therapies that were ultimately unsuccessful) so it is probably better not to worry as much about the specific time of the event, and instead focus on whether the patients survived to achieve a good outcome. Which brings me to…

First, I’m not a fan of your advisor suggesting that you do a statistical method simply for your own education - I’d rather they discuss the best method for the study question, and choose accordingly.

Second, I was indeed suggesting logistic regression where the outcome was a binary “survival to discharge home” (or whatever represents a “good” outcome to you - my suggestion is basically just trying to filter out people who were technically alive at the time of “discharge” but were being sent to hospice care or some other negative outcome; in a few studies I have worked on, a “discharge to hospice” was considered equivalent to death). If you so choose, it is possible to perform a logistic regression with a multi-level ordinal outcome (I think Frank has some examples of the proportional-odds model elsewhere in these forums) if you create an ordered hierarchy e.g. “survival to discharge home” being the best; “discharge to long term care facility” in the middle; “in hospital death” being the worst).

The Fine-Gray regression that you referred to for competing risks also is a time to event analysis, and as discussed above, I’m not sure that time to event is necessary or appropriate here. Frank makes a good point that when there are significant concerns about loss to follow up or administrative censoring then the time to event has an advantage, but if you truly know the final dispensation of every patient in the dataset (e.g. you know whether they survived the hospitalization or not) then I think that concern is modest or even nonexistent.

The time to event analysis clearly is more appropriate if your question is how diabetes affects the long-term survival of these patients, but as I understood it was more a question of whether it impacts their ‘acute’ outcome, where IMO ‘time’ is less important than the final status of the patient.

MSchwartz · December 2, 2019, 7:40pm

Hi Afshan,

I was traveling over the holiday here, so just getting back to you now.

I agree with the comments that Andrew makes regarding your advisor and the use of a Cox model. I don’t really see a role for it here, with the relatively short term outcomes.

Having been involved in the development of short term, in-hospital/30 day outcome models, I would advocate the use of logistic regression here for the binary outcome of discharge alive versus discharge dead.

One question, which I don’t see answered above, what is the incidence of death at discharge for these patients?

Andrew raises the notion of a proportional odds model, to handle multiple, exclusive and possibly ordered outcomes. However, that approach has it’s own assumptions and if violated, you may need to think about a multinomial model instead, if you go down that path.

The short time frame between admissions for patients with multiple admissions is curious. Are these actually unique infections, or a recurrence of the prior infection that was not fully resolved? If the former, does it suggest that a re-admission is perhaps a surrogate for some other set of behavioral or environmental characteristics for these patients that put them at risk for a repeat infection? If the latter, is it a surrogate for a set of circumstances around the original care for the patient, perhaps including choices made by the patient and/or care team relative to overly conservative or incomplete care, or the patient being discharged prematurely AMA?

The answer to that question might affect how you model the re-admissions and the covariates that you should consider.

If you do include the repeat admissions, as opposed to the primary admissions only, I would be sure to use re-admission (Yes/No) as a covariate to differentiate those patients.

You raise the notion of using year of admission as a covariate, which I would support. I would use year as a continuous variable, and not break it into groups. The trend, if one exists, may be subtle, and you may even consider using a cubic spline transformation on year, to handle a non-linear (non-straightline, possibly even non-monotonic) relationship.

On the data coding issues that you raise, if this information is coming from an administrative dataset, there are many failings in those, since they are coded to maximize reimbursement, and not to accurately reflect all comorbidities at admission, because there are a finite number of such codes that can be used. The presence or absence of a specific code over time, for patients with multiple admissions, may be a reflection of the coding system in use, and the heuristics used to maximize reimbursement, rather than due to oversight. There can also be problems in differentiating comorbidities versus complications. Keep that in mind as you review the data.

Frank raises an interesting point regarding active, but not yet discharged patients. As did Andrew, I had not considered this, but would recommend that you only include patients where the discharge status is known in the setting of logistic regression.

f2harrell · December 2, 2019, 10:17pm

To an earlier point there is nothing wrong with using time-to-event analysis with 9-day follow-up. There’s just not much to gain unless you have \geq 1 censored observation or unless you want to estimating P(T > t) where t < 9.

EpiLearneR · October 27, 2022, 5:18pm

dear prof,
If we take time to successful hosptial discharge as the outcome, will death be considered as a competing risk?

f2harrell · October 27, 2022, 5:49pm

No, death would be considered a failure to discharge home. Need to check that death doesn’t bring in some dependent censoring issue.