Controlling for baseline in logistic regresssion (binomial outcome)

martinmodrak · November 1, 2024, 8:40am

I have a binomial outcome measured pre and post intervention (number of successes in task, number of trials is constant). I’d like to control for the baseline in my model.

It seems to me that using just the number of correct answers at baseline as a predictor is unnatural, as I can hardly expect linearity here. The natural thing to do would be to transform the baseline measurement to log-odds and use that as a predictor, so that a coefficient of 1 would correspond to a perfect match between baseline and post measurement.

This however leads to a problem with handling 0% / 100% success at baseline. A simple thing would be to replace those with 0.5 successes / 0.5 failures, so e.g. when I have 15 trials, I’d treat zero successes with 0.0333 success rate. This however seems somewhat arbitrary and inelegant.

A more elegant way could be to handle this is a measurement error/Berkson style problem, i.e. treat the log-odds at baseline as unknown, but informed by the baseline measurement. That seems almost equivalent to having a fixed effect for each subject, and that feels less than great. Notably, similar measurement error considerations should apply to all models controlling for baseline and I haven’t seen this handled by a measurement error model anywhere, so I assume there is a reason to not do that?

Thanks for any suggestions/literature pointers. I could find that transforming baseline measurements to the model scale was considered beneficial in negative binomial models (https://doi.org/10.1002/bimj.201700103) but nothing about logistic/binomial response.

f2harrell · November 1, 2024, 11:41am

I think you’re making it too complicated. If you want to condition on baseline and not assume linearity you can just use a spline function in the count or proportion. And if you think something special happens at zero you can add a discontinuity to the model as exemplified in the gTrans function in rms at the end of this chapter.

martinmodrak · November 1, 2024, 2:31pm

I agree splines should be a decent option, if I have enough data to constrain the spline. I however don’t have that much data, so doing some extra modelling steps to make the model better match domain knowledge without using splines seems warranted (here the domain knowledge is that the performance shouldn’t change too much from baseline - at least in the control condition) .

trinhdhk · November 5, 2024, 6:45pm

Why don’t you use a mixed effects model and constrain the effect of intervention at baseline to be 0? The interaction term between intervention and time shall be the estimator you interest shan’t it?

martinmodrak · November 6, 2024, 12:06pm

My impression is that there is broad agreement that controlling for baseline is in most cases superior to a random intercept per participant, because random intercept will assume joint multivariate normality of the baseline and outcome measurements (in the latent log odds space) while controlling for baseline works regardless of the distribution at baseline. Joint normality is most notably violated when the trial inclusion depends on the baseline values (which is not my case, though) see e.g.