Warning: not a statistician, so apologies if this is a dumb question!
I was wondering if you could help me out with the following.
When using hospital length of stay as a primary endpoint, the power calculation often treats ‘days in hospital’ as a continuous variable. However, it seems to me that it should be treated as a discrete variable, since patients get discharged on a specific day, not at a particular time in the day. If the LOS is 5 days, for example, the variable can only take the values 1,2,3,4 or 5.
It seems to me that losing information with discrete variables should lead to higher sample size. We lose resolution/granularity compared to, say, LOS in hours, which intuitively (to me) should be accounted for in power calculations.
Am I incorrect? Thank you for your time!
Even though an individual case has a discharge value that is discrete, when a number of cases are combined, it is reasonable to me that a mean is computed as a sufficient statistic to summarize the information in your sample.
The mean may be less reasonable if the distribution is not symmetrical, however.
This section from Ch. 5 of Biostatistics for Biomedical Research: Power and Sample Size discusses how to calculate precision for a continuous variable, which is more relevant and useful.
In addition it would be good for some experts in LOS analysis to weigh regarding which time resolution (days vs hours) is relevant. A primary reason for measuring LOS in hours is to break ties. Statistical methods work better with fewer ties. For power calculations it probably doesn’t matter much, but for final analysis more accuracy will be found with fewer ties.
There are a few things to consider here:
Length of stay (LOS) is rarely, if ever, normally distributed, and usually highly right skewed. As a result, a two sample t-test (presuming two groups) on the untransformed LOS data, whether in days or hours, with a null hypothesis of no difference in the mean LOS between the two groups, and an alternative hypothesis of a difference in the means of some magnitude, may not be of value in that setting. Median LOS may be more informative.
Are you likely to have any patients with a 0 LOS, presuming LOS in days, where the patients were admitted and discharged on the same day? Or will patients in this setting always have a minimum of at least 1 day LOS?
Will you know the LOS for all patients, or might some patients die before discharge (in-hospital death), which can skew the distribution of LOS, and require the need to deal with censoring and/or perhaps to consider a multi-state model approach? In this setting, you would not be comparing the mean or median LOS between the two groups, but the probability of being discharged alive on a given day.
Will you have the admission and discharge times for every patient, besides the date, in order to calculate LOS in hours versus days for all? Or, might you end up missing the time for some and only know the date?
If you use LOS in hours, how might administrative/logistics related issues affect that time interval in a material and biased fashion, where those issues are not relevant to your underlying question? For example, patients in one group tend to be discharged in the morning, while patients in the other group tend to be discharged in the afternoon leading to longer LOS in hours in that latter group?
What is the magnitude of the LOS difference that is relevant to your question? Would a difference of 8 or 10 hours be important, or are you really looking for differences on the order of multiple days? Not that you could not convert large numeric values of hours back to days, but it may be relevant to the approach that you take here.
The answers to the above may lead you towards or away from a particular analytic approach that will impact how you conduct your power/sample size analysis.
For example, one approach is to use log(LOS), to attempt to transform the LOS data to approximately normal for subsequent analysis using a two-sample t-test on the log transformed data, and then back-transform the results for interpretation. However, that is problematic if you will have patients with 0 LOS because log(0) is undefined.
Alternative analytic methods might use non-parametric tests to assess location shifts in the distributions, or perhaps use a two-sample t-test on the ranks of the data.
Another approach may be to use count based models (e.g. Poisson or negative binomial) to compare the groups in a regression setting. If you may have patients with 0 LOS, then zero inflated count models may be attractive, especially if the 0 LOS patients may be due to a differing process leading to “excess” 0s.
You would likely need to conduct Monte Carlo simulations, generating data from relevant distributions, to perform your power/sample size assessments in these settings, rather than using a canned approach, such as would be the case for a two-sample t-test on normally distributed, continuous data, as would otherwise be common.
Food for thought…
Wow Marc. Lots of incredibly valuable information. I really like the idea of emphasizing the probability of being discharged as a function of time since admission, in a multi-state model.
Incredibly useful - thank you!
Everybody spends at least 1 day in hospital. As you pointed out, the distribution would be right-skewed. We had thought about transforming the data, but your 2nd suggestion (multi-state model) seems like a really good idea. We are actually looking at ‘number of days with chest drain in place’ (evaluating 2 different types of drainage), but some patients go to surgery and for those the primary endpoint cannot be adjudicated.
Finally, it does not make clinical sense to look at a more granular endpoint (hours with drain in place) as it would only make a difference for the patient if it leads to discharge earlier, which only happens on a daily basis. We prefer "duration of chest tube drainage’ because our intervention affects this directly, whereas LOS is affected by a million other things (e.g. availability in rehab facility etc…).