Maximum number of predictors in pooled logistic regression

hanowell · September 28, 2019, 8:22pm

I am predicting the probability of an event over a 30-day observation period. Many of my subjects are present in the observation period for fewer than 30 days, so I am doing a pooled logistic regression on the outcome cbind(event_indicator, days_exposure - event_indicator). The minimum number of predictors in a logistic regression model is something like m/15, where m is the minimum of the counts in each category (for the case when all subjects have the same exposure). But in my case, I believe that m is in fact the sum of the fractional observation periods among the cases, NOT the simple sum of the event indicators. In this case, the maximum number of predictors is far fewer. Am I correct in my calculation of effective m?

f2harrell · September 29, 2019, 11:43am

I am not sure of the fractional part, but a good starting point for m is the number of subjects having the event occur at least once.

pau13rown · September 29, 2019, 4:46pm

how many predictors are you interested in? i mean, is it way out there, or are you close? because the rule-of-thumb is just a rule of thumb, and you might check: How to develop a more accurate risk prediction model when there are few events

hanowell · September 30, 2019, 11:06pm

I am already using penalized regression. Nevertheless, I don’t trust that penalization will do all the work for me, so I would rather get as close to the rule of thumb as possible.

hanowell · September 30, 2019, 11:12pm

@f2harrell In my case, the event is terminal, so it can occur only once. The reason I wonder if the event count is appropriate is because some subjects are not observed for the full observation period. For that reason, they only contribute y_i \times e_i/t person-periods to the denominator of the rate, where t is the length of the period, which in my case is 30 days. (In this case I’m thinking I can treat the probability as equivalent to the rate since the event is exceedingly rare.) Therefore, I can’t take m = \sum_i y_i and must instead calculate m = \frac{1}{t}\sum_i y_i \times e_i.

pau13rown · October 1, 2019, 6:04am

but important covariates are important covariates regardless of any rule of thumb, a rule of thumb cannot dictate the model, that’s why i’d rethink how many do you really have

f2harrell · October 1, 2019, 10:51am

The fraction of the observational period observed for a given subject is very relevant in estimating absolute risks. For relative effects (e.g., hazards ratios) the number of events dominates the effective sample size calculation.

hanowell · October 2, 2019, 12:31am

but important covariates are important covariates regardless of any rule of thumb, a rule of thumb cannot dictate the model, that’s why i’d rethink how many do you really have

Indeed, I consider the constraints imposed by the rule of thumb as a way to force myself to think about these things. Don’t worry, I do not use it as a pat rule.