How to evaluate the quality of risk adjustment models?

I recently was debating a health economist on a risk adjustment model used for a heart failure (HF) readmission model. The metric is meant to measure the number of HF hospitalization over the total number of HF patients cared for a given 12 month period. The model is risk adjusted to account for the severity of the outpatient population. The risk adjustment model uses administrative coding data to primarily estimate the expected number of HF hospitalizations by facility.

The approach to risk adjustment used is not causal. The model has evaluated all available variables that coded for most patients. This included age, sex, minimal insurance/SES variables, and marital status. It then includes every CCS code. CCS codes (about 260) are developed by the AHRQ to collapse all ICD10 codes into workable categories. Some of the CCS codes I noticed had large weights in the final logistic model included factors like ectopic pregnancy. If a patient had an ectopic pregnancy they were really unlikely to ever be admitted with HF (kind of obvious). Therefore, if a facility treats many patients with ectopic pregnancies, there HF population is seen as “healthier” and they should not have as many expected admission. The c-statistic for this administrative model is ~0.8. Most HF readmission models of hospitalized HF patients have c-statistics around 0.6 - no good predictors for who is likely to be rehospitalized at discharge.

I am troubled by this approach to risk adjustment. It is akin to dredging meaningless associations that are false and do not account for the concept of risk. The purpose of risk adjustment is to adjust for known factors that make disease management more difficult. When you dump a bunch of garbage codes you induce false associations that do not actually adjust for risk. If there are only 150 faciltiies, and they very in how they code diagnoses or if they have a busy “women’s clinic” with more ectopic pregnancies, then the model may just be selecting for differences in facilities that have nothing to do with the severity of HF seen in the clinic.

Anyway, I feel risk adjustment requires more theory to be done appropriately and there also needs to be sufficient predictive ability or else the risk adjustment is arbitrary. Appreciate other’s thoughts both on the theories behind risk adjustment and how to evaluate adequacy of risk adjustment models. I think like anything we would look at model fit, log-likelihoods, c-statistics etc.


Excellent question @boback. I’m of two minds on this:

  • When comparing programs we need to very liberally adjust for a huge number of variables, with the main requirement being that the variables are somewhat accurate, and that they be measured at admission and not during the hospital stay (except: predicting readmission often uses “during admission” measurements). This is the kitchen sink approach.
  • Or use diagnostic codes only through established comorbidity methods such as Elixhauser and Schweiss. Use mutliple comorbidity indexes in the model, and use all the Elixhauser diagnostic group indicators.

So I can’t say that they did anything wrong but when you see crazy regression coefficients on some categories it does give pause.

The c-index doesn’t have much to do with the adequacy of a risk-adjustment model for provider profiling. Adequacy means instead that no other model that is probably better than yours gives different provider rankings than yours (to oversimplify).

My thoughts on provider profiling and risk adjustment for same are here which includes a link to collected papers.


This is a really interesting topic. There is a very nice committee paper on the topic of using risk adjustment models for hospital performance comparisons here. Ultimately, the answer to the question of whether it’s appropriate to condition on particular types of information will partly depend on what the purpose of the risk adjustment is. For the sake of performance comparison, where interest is often in comparing something like the ratio observed/expected (comparing how you did relative to how we would expect you to do) you could argue it’s important to not condition on certain types of information to ensure that comparisons are fair. In the example you provide, SES variables and minimal insurance strike me as potentially problematic.


There are two specific issues:

1st is calling death a competing risk and not a censoring event. If you die, you can’t have a HFH. If you have censoring, you have to use Poisson or Cox models (Anderson Gill) for HFH. For Poisson, subtract post-death observation time from the offset. For Cox, censor at death. When you do this, covariates that are prognostic for death attenuate toward zero: ectopic pregnancy, pancreatitis, electrical storm, etc.

Say we have 1,000 ICD codes as possible risk-adjustment inputs. Is it bad to consider using all of these in a model? No. Provided they do not violate any inferential criteria (such as being a collider or mediator), can high dimensional modeling generate hypotheses as to how these 1,000 codes can predict HFH? Yes. Your common sense helped you realize ectopic pregnancy violates inferential criteria.

A statistician might say to use a high dimensional learning approach to handling those 1,000 codes. I’d make this an exploratory aim. An overfitted logistic model is not high dimensional learning. Logistic models don’t play well with a high ratio of predictors to observations (biased coefficients).

So a) your colleague’s highly weighted risk-adjustment factors are just telling you that dead people spend a long time not being hospitalized for heart failure. SOLUTION: censor death events.
b) your colleague can’t pick which of 1,000 ICD codes should be used as risk-adjustments. SOLUTION: prespecify a clinically useful model with a DAG or good literature. Then fit a Ridge GLM to see if the inference is all that different.


In my experience, risk adjustment models of this type, especially for readmissions, almost always stratify or select based on the index condition, in this example, HF. (Heart Failure.) One would think that the number of HF patients who ALSO have an ectopic pregnancy during the lookback period would be extreme low. Or at least, I would think that, but I am not a clinician. The implication of this predictor being included in the model is that, while the incidence rate is probably very low, when it does occur it has a powerful association with the dependent var, readmissions. (Well, DUH.)

Using all of the diagnosis codes, boiled down into the many CCS categories, is a fairly common approach. While it might seem to be just throwing everything against the wall to see what sticks, there’s a plausible case to be made that it’s not quite that speculative. In theory, what should get coded as diagnoses on claims and discharge summaries is every condition the patient had that in some way influenced, or had to be considered, in the plan for the patient’s diagnosis and treatment. And each such condition represents some degree of additional risk. But it’s a very complex stew. Undoubtedly there are interactions, not all of these things will be linear add-ons, and the relationships will likely vary according to the dependent variable being modeled.

I look at the dx-related things as being fundamentally a feature reduction process. From the hundreds, what are the relatively few that really matter? (That we can tease out and approximate with some credible degree of statistical soundness.)

Best case is that, if the modeling process was done with robustness and generalizability in mind, the model has identified the most important of these dx conditions. Done poorly, there’s certainly the possibility of fitting a lot of noise.

Whether or not it’s appropriate to include factors like SES and insurance coverage depends on the use case. If I’m at a hospital system building a readmissions prediction model that’s going to be used by the population health case managers to identify high-risk patients who should get some degree of extra attention, i.e., an “intervention” either during the discharge planning process or post-discharge, definitely I would use those. But if I’m doing public reporting of risk-adjusted provider outcomes, maybe not. My sense is that the general practice is to allow SES and insurance as proxies for access-to-timely-care, but race is often not used so that any racial inequities are not just totally masked.

It’s also worth noting that risk-adjustment using data from encounters before the index period can only be credibly done by organizations that have a “all the claims” view of all of the patients/members involved. Provider systems should know about their encounters with patients, but generally don’t have much usable information about patient encounters with unrelated doctors/hospitals/urgent-care-centers/pharmacies, etc.

The new Krumholz et al paper, out this month, tried using building dx-code-specific predictors rather than using broader dx categories. (HCCs in this case, rather than CCS-dx groups.) Actually they did it both ways, and compared results. That seemed to work pretty well. [Paper here] (


What worries me is that coding isn’t accurate. There’s a high prevalence of under/over-coding in administrative health data. Features that may be prognostic in non-selective model are likely are strongly associated with facilities and their coding practices. If every facility varies in how they code obscure conditions, then you are reinforcing ranking based on random adjustments. There are hierarchical adjustments for clustering, but that doesn’t resolve how biases might be introduced.

When CMS and the Yale group created their HF readmission risk-adjustment model, they went with a parsimonious model and had clinician input on what risk factors made caring for HF patients more difficult. They also dropped all negative coefficients from the model. That means you aren’t penalized for coding more than someone else for certain conditions. Although the CMS model still led to unfair penalties for low SES hospitals with no clear benefit to patients (potential harm) when paired with reimbursement policy. Some of CMS approach to risk adjustment make sense to me. Risk factors should be causal and associated with the measured outcome. Thousands of additional variables I do not believe makes risk-adjustment any better even if you are avoiding over-fitting and you risk introducing many biases (colliders, mediators, etc.)


If I am going to play basketball against an unknown team, I want to know how well their players shoot, their heights, speed, etc. I don’t need to risk adjust my potential performance against based on the quality of the popcorn in the stands or their jersey colors. Risk adjustment should follow causal thinking in my opinion.

Can’t argue against the issue of coding accuracy. Or completeness. My sense is that there aren’t many hospitals that “code just enough to justify the reimbursement” anymore, there’s too riding on the outcomes reporting, risk-adjusted premiums and like for coding just the bare minimums to be a viable strategy. So the gap between the maximal coders and the minimal coders is shrinking. But will never go away.

It’s always interesting/distressing that when you read a case study of how some facility improved their mortality performance, it seems like 2/3rds of the time they moved the O/E denominator by improving their coding.

1 Like

The ratio of observed to expected cannot be used to compare hospitals, because of Simpson’s paradox. See van de Mheen and Shojania 2014.


I’m also troubled by the approach taken here, but the ‘primary insult’ arises not from statistical technique, but from the philosophy of science that undergirds the whole project. This entire technocratic enterprise of ‘risk adjustment’ proceeds from something akin to a Machian idealism that supposes the phenomenalistic deliverance of medical billing codes sufficient to the task of centrally administering a health care system:

First, what do biostatisticians take as their phenomenalistic deliverance, analogous to the dials on Mach’s lab instruments? Here, it seems you have the data you tabulate, and quantities readily calculable from those data. This is, after all, your ‘direct experience’. 5/

— David C. Norris, MD (@davidcnorrismd) March 18, 2019

Beneath all of this seems to be the faith that we can measure “metaphorical oomphs” …

The primary focus in interpreting therapeutic clinical research data should be on the treatment (“oomph”) effect, a metaphorical force that moves patients given an effective treatment to a different clinical state relative to their control counterparts.

Mark DB, Lee KL, Harrell FE. Understanding the Role of P Values and Hypothesis Tests in Clinical Research. JAMA Cardiol. 2016;1(9):1048. doi:10.1001/jamacardio.2016.3312.

… the deeper (mechanistic/theoretical) probing of which is thought to be neither necessary nor even desirable.

By contrast, a genuinely scientific approach to risk adjustment would begin with deep theorizing about processes of disease progression and management at the individual-patient level, and would always recur to this level as providing a principal criterion of truth.

I put it to you that a risk-adjustment model for HF care is plainly invalid if it eschews theoretical knowledge sufficiently detailed for retrospective analysis to identify care process errors in particular cases. (For example: this patient’s self-recorded weights at home were clearly increasing for 4 days and should have triggered a nurse home visit.) Without such a concrete grounding in the care of the individual person, this discussion can only float off into the abstract realm of statistical method. There will be no realistic basis for correcting errors, or refining understanding.

Since a dispute with an economist prompted the OP, I’d like to advance a variation on the old economist-deflating jibe, “If you’re so smart, how come you ain’t rich?” In this case it’s “If your risk-adjustment models are so great, how come you asked a nurse to recommend a cardiologist for your parent with heart failure?”

As a rule, I generally regard survey research as belonging in the same do-not-read class as nutritional epidemiology. But I strongly suspect a survey of nurses would perform better than any of these risk-adjustment models, at least for identifying high-quality doctors within and across institutions.

We all know that, when it comes to things we sincerely care about (like care for ourselves and family), we use our personal connections and not statistical models to assess quality-of-care questions. These risk-adjustment models are by and large about the care that elites are willing to accord to ‘the other people’.


Here you go:


Boback, I agree that 'treatment effects [harms & benefits] constitute substantial evidence for determining the efficacy of effects, at least in the short term.

In so far as theorizing, access to data is key obviously. Unless it is shared, theorization suffers. The scientific qualifications needed to evaluate and forge theory are not yet well-articulated IMO.

1 Like