Sample size in mixed model study (exploratory prognostic factor study)

Lingxiao_Chen · March 28, 2019, 6:01am

Dear all,

I got a problem when I was thinking about my study design.

The study is an exploratory prognostic factor study. 10 factors in individual patient level and 2 factors in hospital levels. I also plan to test one factor as an interaction term.

I have two outcomes. One is the length of stay, which will be modeled with the negative binomial method. The other is the reoperation rate, which will be modeled with survival analysis with the death as the competing risk.

I am thinking how to confirm I have enough sample to explore these factors, or at least get a relatively accurate estimation. For the length of stay, it seems ok because I could use all of my data with the data imputation. For the reoperation rate, things become so intractability.

I have two problems. The first is whether I should run my model for each time point (For example, 1, 5, and 9-year reoperation rate). Maybe I should run once if the assumption of proportional hazard is valid? I am not sure.

The second is the sample size for different time point is different. For example (here I use the data from JAMA. 2018;320 (16):1659-1669.), the crude risk for 1-year reoperation rate is 1.4% (1275/90215), 5-year reoperation rate is 2.9% (1508/52715), and 9-year reoperation rate is 3.4% (240/6981). If I run the model for the 9-year reoperation rate, the sample size is 240. Considering the possible non-linearity, categorical outcomes and interaction term, this sample size seems too small. I am wondering whether I should do some kinds of variable selection (For example, LASSO or backward). If not, what is the right way for me to run the model?

Thank you for your time and wish you have a great day!

Best wishes,
Lingxiao

f2harrell · March 28, 2019, 12:36pm

Just a side comment regarding length of stay. Can you convince us that the negative binomial distribution fits such data? And how does it handle censoring due to death? I’ve found that the Cox proportional hazards model fits length of stay quite nicely on occasion.

Variable selection is not reliable so I would try to avoid it, whether lasso or other methods.

For re-operation, recurrent event analysis may be applicable. Some of the best information about this is in Therneau and Grambsch’s book Extending the Cox Model and in Terry Therneau’s vignettes inside the R survival package.

Lingxiao_Chen · March 28, 2019, 11:21pm

Thank you so much, Frank!

First, the length of stay here means the length of hospital stay because of the data availability. For the negative binomial distribution, the data is right skewed and SD (10.03) is larger than mean (3.84). I think the data is overdispersion and then choose the negative binomial distribution. For the death during hospitalization, the rate is small (0.8%, 109/13599). Do you think should I use the Cox proportional hazards model as the main analysis method? Or should I use it as a sensitivity analysis to test the influence of death? Or it is ok to use negative binomial distribution considering the rate of death is small?

Second, I will not do variable selection based on your suggestion. But I am still worried about the sample size issue. Maybe it is ok for an exploratory study?

Third, the proportion of patients who underwent 3 or more surgeries is very small (0.35%). I will read more materials about recurrent event analysis based on your suggestions.

f2harrell · March 29, 2019, 11:57am

For assessing model fit with respect to distributional assumptions, do either an overall analysis like the following, or a stratified analysis, e.g., stratifying on quartiles of predicted values or by the most important predictor. Plot the empirical cumulative distribution function from the raw length of stay values, then superimpose the distribution implied by the negative binomial. For the Cox model, you just look for parallelism of log-log(1 - Kaplan Meier estimates).

Regarding the sample size, you can use ridge regression (quadratic penalized maximum likelihood estimation) or an initial unsupervised learning (data reduction) step, among other approaches.

timdisher · March 29, 2019, 12:07pm

Wouldn’t death be informative censoring and/or a competing risk?

f2harrell · March 29, 2019, 12:13pm

Great question. I wish I remember which authors to give credit on this—I think it was a paper in J Clin Epi—but I like to change the target of the analysis to time to successful discharge and act as if a death on day 3 of hospitalization is an interruption in this process, treating time until successful discharge home as 3+. With this new dependent variable, it is possible that there is informative censoring but I think (and hope) not. I would enjoy responses from others who delve into informative missingness details.

If you don’t like the new dependent variable or feel that informative censoring is too big a problem, you can use the Cox model to analyze the regular length of stay variable with no censoring. Aside: the Cox model is excellent for analyzing hospital costs also.

Lingxiao_Chen · March 31, 2019, 10:00pm

Thank you, Frank and timdisher!

I will use the Cox model to analyze the regular length of stay variable (I did not know Cox model could also handle this situation well) and plan to delete the death (death during hospitalization) data considering the rate is very small (0.8%).

For the sample size issue, my understanding from your suggestion is using ridge regression (based on cross-validation to choose lambda) to shrink the parameter estimation rather than select the variable using other methods. Is it right?

f2harrell · March 31, 2019, 10:15pm

Right, and I wouldn’t do a condition (on surviving hospitalization) analysis. I’d either censor on death or treat the time until death as a complete length of stay.

Lingxiao_Chen · March 31, 2019, 10:34pm

Thank you so much, Frank!

Yes, you are right!!!

The event should be time to discharge. Time to death should be treated as censoring. I could use all data and no need to delete the death data.