Logistic regression assumptions

Greetings:

This is my first post to this forum and I want to thank Prof Harrell for creating this resource for students of this subject like me.

On to my question:

RMS p. 221 states: “logistic model makes no distributional assumption whatsoever.” [emphasis added]

May I ask how to reconcile it with the work by economists in the context of choice models that asserts that logistic model assumes that the underlying utility must have an extreme value distribution. For instance here is a quote:

The relation of the logit formula to the distribution of unobserved utility (as opposed to the characteristics of choice probabilities) was developed by Marley, as cited by Luce and Suppes (1965), who showed that the extreme value distribution leads to the logit formula. McFadden (1974) completed the analysis by showing the converse: that the logit formula for the choice probabilities necessarily implies that unobserved utility is distributed extreme value.

Discrete Choice Methods with Simulation, by Kenneth Train, Cambridge University Press, 2002 (berkeley.edu), Chapter 3, p. 34

There are lots of ways to derive models. The fact that there is a link to utility theory doesn’t mean that the logistic model makes a distributional assumption. We are talking about the data distribution (and sometimes the sampling scheme) when we refer to model assumptions. The only distributional assumption made by the binary logistic model is that the observations are independent, i.e., you have one observation per subject. Other than that, any time you have a binary Y=0, 1 you automatically have a Bernoulli distribution no matter what (assuming independence holds) so in reality you are assuming nothing. It is the non-distributional regression assumptions (right hand side of the model) that matter in logistic regression.

1 Like

Blockquote May I ask how to reconcile it with the work by economists in the context of choice models that asserts that logistic model assumes that the underlying utility must have an extreme value distribution

Not sure if it helps with your question but we use utilities differently in our group for clinical applications, including for ordinal outcomes: first we use whatever statistical model is more likely to fit the data best and is computationally tractable. Then we convert the relevant parameters of the model into outcomes that are clinically interpretable by clinicians (e.g., probability of response to therapy at 6 weeks) which facilitates eliciting relevant utility functions for these outcomes. This allows conversion of probabilities (including joint probabilities of outcomes of interest) into mean utilities that can be used to guide decisions that maximize mean utility. See an example here within a phase I-II trial modeling context. Note that we used there a probit link (similarly to what economists often use) instead of a logit, as probit often aligns with Juhee Lee’s modeling preferences in part because the effect of frailty vectors can be more clear under probit. But in theory logit could have been used under a different modeling philosophy.

Now, since whether to use logit or something else is itself a decision, the trade-offs can in theory be expressed as utilities as well within such a framework at a more meta- (or hierarchical) level. But I cannot see how logit would be incompatible with the utility-based framework described above (other than making aspects of modeling easier as mentioned)?

1 Like

This is classic decision making. It is best to separate, as you did, the data model from the utility model. The data model is used to provide evidence for or against various assertions. Utilities measure only the consequences of acting on assertions.

2 Likes

Can we say that right hand side should be linearly related with logit value(left hand side, link function or log odds), is it an assumption?

Yes, linear in the coefficients at the very least. Otherwise, in the general case, you are assuming that the functional form is correct. If you just do a simple additive model then that assumption translates into linearity in x as well.

The more the true function deviates from the assumed one, the more biased the effect sizes are. If the relation is monotonic, then the p-values may still be okish, but still biased.

One way to check the assumption is through cross-validation but for logistic in a context where you care about the probabilities you may want to evaluate binary cross entropy directly, rather than accuracy, AUC, etc.

Traditionally, inferential stats didn’t care much about the out of sample performance, but as described in the tidymodels chapter here: 9 Judging model effectiveness | Tidy Modeling with R, it is still important in the opinion of some modern statisticians as it gives an idea of “goodness of fit”.

This is also why nowadays there are newer methods like Targeted Maximum Likelihood Doubly Robust Estimation and SuperLearner but they get pretty complex and still very new.

1 Like

Thank you for eye opening discussion.

I don’t think that cross-validation or out-of-sample performance are the ways to look at model fit.

3 Likes

Yes, these are tools for internal validation (better to be stick with 10 fold cross validation with replications or bootstrap) rather than assumption check.

How else would you assess “goodness of fit”, like a number to report. P-values from say some GOF test don’t say much (not rejecting the null doesn’t mean it is true).

In the Bayesian world there is also loo: Information criteria and cross-validation — loo.stanreg • rstanarm

(I don’t think its good to do model selection and inference on the same dataset with these, that has issues, but more as just some GOF metric)

Directed GOF tests, e.g., a combined chunk test of all nonlinear terms in a model, have proper degrees of freedom that completely take overfitting into account. But as you stated p-values from them are not what they seem. To get away from p-values and to be sample-size-independent I would compute likelihood ratio \chi^2 statistics (LR) for various chunk tests such as all nonlinear effects and all two-way interaction effects. From each subtract their degrees of freedom to correct for chance. Divide these by the total model LR after subtracting its d.f. This will provide you with optimal measures. For example if interaction effects explained only 0.01 of the total explainable outcome variation you’d not be too worried about omitting interactions from the model.

4 Likes