Disentagling modeling choice of Categorical Covariates vs Hierarchical Modeling

alanlhutchison · May 21, 2020, 9:57pm

I’ve been working out my understanding of modeling, and I’ve come across a conceptual snag that I wanted to air:

In keeping with prior posts, say I am looking at the likelihood that a patient presenting to the emergency department will be admitted to the hospital. My data looks like this:

Time…Admission...Age...Birth-State...Gender
01:14…TRUE........45......Illinois....M
03:25…FALSE......23......Wyoming...F
14:25…FALSE......89......Ohio.....M
23:22…TRUE.......34......Delaware..F

The question I am trying to understand is the relative benefits / correctness of modeling these different categorical variables as covariates or hierarchically.

If I think of ‘Birth State’ as exchangeable, then couldn’t I just model the probability of admission hierarchically off of ‘Birth State’, where each category therein would contribute to my estimate of probability of admission? This seems more logical, since I can pool my estimate of the probability of admission with the information from these different categories. However, does this make it more difficult to estimate the contribution of being ‘Illinoisian’ to admission?

Or should I model it as a series of coefficients on a binary vector [Alabama, Alaska, Arizona,…,etc.] so that an Alaskan presenting to the ED is [0,1,0,0,…,0]. This gives me a lot more coefficients, and lets me see the relative contribution of each ‘Birth State’, but feels like I’m not sharing or combining data across these different categories properly…

Regarding Age, since it’s a continuous variable, it seems more logical to model it as a covariate. For gender (let’s presume it’s binary for the sake of this discussion only), it seems like it could go either way, hierarchical modeling vs categorical covariate, though it just requires one coefficient.

Does it really just come down to the question I’m trying to answer with my model? Relative contribution of variable vs best estimate of probability of admission?

P.S. Birth State is a silly metric for most EDs, but let’s say it’s the interplanetary ED stationed in low-Earth orbit above the United States in the future.

f2harrell · May 23, 2020, 2:51am

Instead of putting location in the model I would use distance traveled from home and the socioeconomic status of the county/zip code/census tract of their home.

alanlhutchison · May 23, 2020, 3:13am

That’s a good solution! I was wondering what also to do though about other categorical variables, such as ‘Race’ where such kind of ordinal/continuous transformation isn’t as available (ignoring for the purposes of discussion the socioeconomic complexities tied up in race).