Comorbidity adjustment in regression: score versus individual components

Comorbidity scores (such as Charlson and Elixhauser) are widely used to adjust for the patients’ comorbidities in observational studies. They’re usually based on regressing the outcome on several, often dozens of comorbidities on a large historical cohort and then publishing the coefficients in some form.

I think many aspects can be questioned about using such scores, but now lets focus on one single issue: if the components of the score are available individually in our study (i.e. we have the dummy variables indicating the presence of each comorbidity), should we still use the score as a covariate or add all the components instead?

My intuition was that it is almost surely better to add the components individually (this means finer control, much-much more importantly, the weights will pertain to our concrete dataset, and not borrowed from a cohort form another country, at another time, with other patients…). The only drawback is of course that it is less efficient as we’ll have more parameters that are to be estimated, which is clearly important for small sample sizes. (Which, according to my understanding, is specifically the reason why these scores were developed at all!)

But I now have a database with tens of thousands of patients, so I thought that in this case there is absolutely no reason why I should use the score instead of the components.

However, I recently came across this paper: Steven R. Austin, Yu-Ning Wong, Robert G. Uzzo, J. Robert Beck, and Brian L. Egleston. Why summary comorbidity measures such as the Charlson Comorbidity Index and Elixhauser score work. Med Care. 2015 Sep; 53(9): e65–e72. doi: 10.1097/MLR.0b013e318297429c.

They claim that they’ve demonstrated “…the utility of the summary comorbidity measures as substitutes for use of the individual comorbidity variables”, and they even present a mathematical proof for this.

But I simply don’t understand them. What they prove ( E[ YX , b( X ) ]= E[ Yb( X ) ], where b( X ) is the comorbidity score ) is of course true, but is totally irrelevant: it says that adding the variables individually in addition to the score is not better, but no one wants to add them in addition, we are talking about adding them instead of the score!!

(Their derivation also seems to assume that the coefficients obtained from the historical cohort are miraculously the same as what we would get for the particular sample we have.)

Nevertheless, it is published in a peer-reviewed journal, I have found no obvious reply or errata, so I was thinking that I might overlook something after all…


without bothering to check the index which i dont know well, does it handle missings in anyway way, or do any of the components have very low incidence? and how many components are there? in the end i would probably do whatever is least likely to raise a query from a reviewer. If they expect to see the score because it is so entrenched in their thinking then id likely use it

First of all, avoid the Charlson comborbidity score because it is not very sensitive and it has an arithmetic error in its construction.

To your main question, it is advisable to use summary indexes derived from the Elixhauser comorbidities if your sample size is small (e.g., < 5000 or 10000). Otherwise it is best to customize the comorbidity weights by fitting all the separate indicator variables in your sample. One of the reasons is that you may be adjusting for different things than were adjusted for when an Elixhauser summary score was computed.


Thanks for the question - it is one I have often thought of myself, but failed to ask of others! I take a pragmatic approach and that is to use a score instead of individual components if (and only if) (1) I have a problem with too many degrees of freedom if I use individual components OR (2) the score is part of a normal (common) clinical assessment process and the research question is looking at the added value of other variables over and above the score.


First of all, thanks for all the comments!

That is totally in line with my intuition. The question is: can you provide a literature reference, which I could cite to substantiate this…? As I’ve written, the only one I found says just the opposite (and I don’t even think their derivation is correct at all).

1 Like

Sorry I don’t have a reference. That should be in the health services research literature somewhere.

It seems unlikely a score based on external data would be better than your own data to address potential confounding by co-morbidities. I don’t know how these indexes were developed, but the idea of a “Swiss knife” index to adjust for confounding seems kind of naive (no offense intended). A confounding variable must be a parent of the outcome. It seems unlikely any co-morbidity included in those indexes would be the parent (i.e. the cause) of any outcome. Even if the outcome is death, there would be co-morbidities that will not act as confounding variables. Their confounding effect may disappear by adjusting for other variables. Also, depending on the context, some comorbidities may be mediators of the effect of the exposure on the outcome. Basically, these indexes assume that confounding is a generic problem, instead of a context-specific (study/population/design) problem. Moreover, these indexes assume that the effect of comorbidity X on outcome Y is constant in all populations. For instance, they assume the effect of diabetes on mortality is the same in all populations. Of course, that is not the case, as this largely depends on timing of diagnosis and treatment, which will vary from population to population. These indexes also assume that the relative effect of comorbidity X and comorbidity Z is constant in all populations, which is not necessarily the case.
Using these indexes could help in improving precision, by reducing the number of parameter needed in your model (under the assumption that adjustment for all comorbidities included in the index is actually necessary). Although they are easy to use, they do not necessarily increase, and may decrease validity.
Bottom line, in most cases you would be better off by using your own data, if your goal is to draw causal inferences. There are other options for reducing the number of variables in your model. Propensity scores is just one of them. These options require weaker assumptions that those implicit in the development and use of scores for comorbidity adjustment.

1 Like

Thank you @lbautista ! I can virtually repeat what I’ve written to Frank: the only thing I’d still appreciate after these wonderful comments is a literature reference which I could cite to substantiate my choice of adding all comorbidities individually as covariates instead of the score. (As I’ve written, the only paper I found says just the opposite, the Austin et al paper I cited.) Especially a reference that includes the point mentioned by @f2harrell , i.e. that this is definitely the good way, if the sample size is large (which you also reiterated when mentioning precision).

Frank, does memory fail you :wink: ? We were advising Joe Piccirillo on his work in this area; it appeared that the comorbidity scoring with ACE-27 could easily be improved
There are some more papers in JCE I believe.


The nearest thing I recall of relevance is “When measuring comorbidity, select a measure that has been validated in a population most similar to the study and for the outcome under investigation.” in Maybe there’s a reference in there of use to you. Pragmatically, many reviewers have problems with less familiar but superior indices.

Tamás, I don’t really have much knowledge about these scores. Maybe proponents of these scores have evaluated their performance. However, you may draw from studies on the performance of propensity score. They increase validity if the propensity score model is correctly specified (see Willamson et al. Statistical Methods in Medical Research 21(3) 273–293). However, they decrease precision. Therefore, in a context with a lot of data, where precision is a non-issue, a regular regression, with a correctly specified model for the outcome will be as good as a using propensity scores. In the case of comorbidity scores, you are assuming that the model used to generate the scores is a good model for your data. There is no way for you to test that assumption, without generating the scores with your own data. In that case, there will be no practical benefit in using the original scores. Now, suppose the original scores are based on three morbid conditions: A, B, and C. This implies that A, B, and C were associated with mortality in the data used to develop the score. However, it may be that in your data B is not independently associated with mortality. Therefore, there would be no need to adjust for B. But you will be including B in your estimate if you use the original scores. By indirectly including B in your analysis, you are diluting the effects of A and C, because their effects are averaged with the effect B when you calculate the score. This will lead to bias.
I hope this helps.

I just did a quick read on the Austin paper, I don’t find it compelling.

#1 - their test group of patients is a narrow cohort of patients with a slow-acting cancer. I don’t think it’s wise to generalize to the entire potential application across all kinds of clinical patients from that.

#2 - If you are working with any kind of a narrowed patient cohort, their premise that the summary weight alone tells you everything is akin to saying that there are NO (statistically detectable) interactions between the primary disease category and any of the individual comorbidity features used in the Charlson Comorbidities Index. Or with things that aren’t used explicitly used, such as age and sex. Zero interactions, anywhere in the data space. That’s just implausible.

#3 - My biggest peeve with the Charlson Index isn’t how it was built but how it is often used … it was designed to predict risk of 365-day DEATH, and that’s about the least frequent thing it seems to be actually used for. For example, I had a fight years ago with a P.I. (er, “philosophical discussion”) about its appropriateness in a utilization analysis of E.D. high-flyers. If you use it with a different dependent variable than mortality, the mortality weights are almost certainly going to be sub-optimal. And therefore you should re-train it for the dependent variable/s you are using.

My theory is that if you’re going to the Comorbidity SuperStore, you might as well pick up all the component ingredients, not the pre-cooked conglomeration that’s been sitting on the hot-plate for hours.

The only reason to go with pre-cooked is if you don’t have the d.o.f. to spend. Or think they would be better spent elsewhere.


Thank you all for all these references and comments!

Thanks for your comments. I find it even more disturbing (if I’m correct) that their “derivation” is, while mathematically correct, completely meaningless… (see my opening post).

Elixhauser et al (1998) - “Comorbidity Measures for use with Administrative Data” is still a good read, and easy to find, for anyone interested in this niche.

They very specifically chose to NOT provide a summary value, precisely because using one standard set of weights would not be appropriate for different patient cohorts. They clearly demonstrate that some of the indicators have very different influence when looking at mortality vs LOS vs charges … as you’d expect. Some flip signs. So one pre-canned set of weights would be a poor compromise for many use cases.

One significant and perhaps under-appreciated “design choice” difference between Charlson & Elixhauser is that, within its framework, Charlson looks at ALL conditions that are (my words) concurrent at the time of admission, whereas Elixhauser looks for pre-existing conditions that are NOT related to the primary diagnosis, under the assumption that users will typically be also using stratification or some other measure of severity that captures the primary condition causing the admission.

The AHRQ software for computing Elixhauser makes the don’t-count exclusions primarily based on DRGs, but DIY implementations may or may not be true to the spirit of the Elixhauser team. Standard inpatient discharge datasets (in the US) will usually have the DRG info, but custom data collections may not.