An example of ML methods used for predicting adverse selection risks in health care

This article, part of a Phd. thesis, uses RF and gradient boosting to predict the risk of adverse selection in health care.

Adverse selection in health care happens when for example when more insurance takers with a prior condition sign up for an insurance. This affects the financial risk of the policy provider.

What strikes me as new is that a lot of economic risk calculation is done using regression modeling. Although the methods researched are not applied in practice, it does show that also for financial risk calculation in health care machine learning is starting to get noted.

What makes the case more reliable is that the dataset is of considerable size, namely 17 million. At this scale a lot of the statistical trade-offs associated with statistical modeling become more or less irrelevant. Or, in any case, that is the discussion.

Remarkably the differences between the conventional approach and random forests and gradient boosting was not that big. But then, with large sums of money, small differences tend to be relevant.

For larger scale decision making with financial decision making I myself have been looking more towards (Bayesian) statistical modeling. The motivation being that the model is simpler to explain, and can incorporate all the uncertainty in the inference. But I guess that if the features make sense a priori, and the sample size is just very big, then explainability and proper uncertainty propagation at some point can loose out to a very good fit.

One of the missed chances in the article is that the differences between the modeling approaches was not tested statistically using several runs of cross validation for example, or another method. This could have added statistical assurance on top of a machine learning method. Just stating a difference without motivating it using the distribution of the difference happens a lot in machine learning; it leaves the whole exercise open ended as far as I am concerned.

From a decision making point of view what I learned from evaluating this article is that comparing models in a statistical valid way, using for example several runs of cross validation followed by a permutation test or Bayesian model, makes the properties of the underlying model less critical.

The article can be found here.

1 Like

For this topic to be appropriate to datamethods you need to introduce the topic, explain why others should know about it, and possibly describe any input or discussion you are seeking. Please edit the top post accordingly.

Yes, done. I added my thoughts for posting the article; way better indeed.

1 Like

I am not a credentialled statistician, but I have been involved in some large-ish scale risk models for health care.

I find it interesting that they chose to collapse the data into the unique combos of the various risk factors. Let’s call each of those groups a “cohort.” And for each cohort, the dependent/predicted variable was the mean spend. PMPY, presumably.

The lure of using machine learning methods like random forests and gradient boosting is that those methods can purportedly detect interactions that are unknown to the modelers and therefore not included in the build process for a typical regression model.

But within many of the cohorts, experience with health care spend data suggests that the spend amount is not homogenous, there will a very wide distribution of values.

By using cohorts that mask the true variability at the case level, they have allowed the ML models to detect interaction patterns that operate at the cohort level, but on the other hand have prevented the algorithms from detecting any interactions that might help explain/predict the most extreme residuals within-cohort.

At least, that’s the way I see it.