From Statistical Thinking News: Untested AI models in clinical care

I found this on another page Dr. Harrell edits: Statistical Thinking News

This was (and remains) a growing concern when I was in a direct patient care role. There has been a growing pressure by non-medical administrators to substitute “Big Data” and “Machine Learning” for the expertise of clinical teams.

The problem: a lack of what Nicholas Nassim Taleb calls “skin in the game” by those who push for implementation of these models.

Those who question managerial competence to select algorithms for decisions that they are not legally permitted to make, are generally weeded out via a process akin to natural selection.

The deployment of a largely untested proprietary algorithm into clinical practice—with minimal understanding of the potential unintended consequences for patients or clinicians—raises a host of issues.


I think the issue here isn’t that the “AI and ML models” are untested: they are typically trained using large samples with cross-validation that includes a randomly selected holdout test / validation group. To me, the issue is that there isn’t a clear process for labeling these in the FDA approval process that provides something akin to a Presciribing Information for a drug. Usually, when people say “AI” they mean “I don’t want to try to explain something that I don’t think you’l understand.” It isn’t a level of transparency that is sufficient for FDA labels, but there are now more than 100 software devices with some form of “AI” in the label. The other issue is representativeness and generalizeablity of the training data. Many of these algorithms include tens or hundreds of thousands of patients, but may be limited to a single clinical site or region. Nor do you typically see a CONSORT-style diagram on how patients were included (or excluded) from model training. So, there is often a generalizeability question that is hard to answer for the end user. Still, the deluge of data relevant to clinical decision making has been growing exponentially and the only way for a clinical team to make sense of it all will be assistance from clinical decision support algorithms.

Whatever validation is being done, it apparently isn’t rigorous enough, because this isn’t the first prediction model to fail in practice. From another thread:

My concern is that your description of “AI” as “something I don’t want to explain” is largely correct. The danger is it grants financial decision makers and developers of these algorithms too much discretionary authority over clinical decision making, when the one responsible for negative outcomes is ultimately the medical team implementing it.

These are actuarial models, whether they are called that or not. Considering the trusted role insurance companies play in the economy, their decision procedures are scrutinized. Something more than internal validation needs to be done before widespread implementation of these models.


Interesting point about actuarial models. My view is we need a more unified framework for FDA review and approval of these models, which have been considered “under enforcement discretion” largely because the end user, a clinician who is responsible, is supposed to be able to understand and interpret the output and act accordingly.

Can you expand a little bit on what we can learn for deployment of actuarial models? This is a line of thought I haven’t heard before and it seems really interesting.

@timdisher @mshapiro123

In the U.S., the Federal government is the largest single payer for health expenditures (36% according to this cms document (PDF)) from 2020.

It has long been a goal to reign in the growth rate in Federal spending on healthcare, as growth has been a multiple of the inflation rate.

Regardless of the merit of the premise, healthcare finance regulators operate under the assumption that fee for service incentivizes this excessive cost growth, and have been pushing a move to so-called “value based” healthcare, which really means shifting cost risk onto health service organizations for individual clinical decisions. The idea is to use “market incentives” to foster effective, cost efficient care, control “waste” and expand coverage. I alluded to the convergence of actuarial models with clinical prediction models in this thread:

My personal experience where the fusion of actuarial thinking and clinical thinking has great potential is in the accountable care model known as PACE (Programs for All-inclusive Care of the Elderly). Economically, a PACE program might be thought of as a managed care organization “owned” by the participants (a regulatory term for eligible individuals who choose to have health care provided by program) but paid for by a fixed fee per member by the government (a capitated managed care plan). There is no “profit” per se as these organizations are operated as non-profit charitable entities.

The organization must figure out how all participants individual needs can be met with the resources provided to it via the monthly fee it receives per member. There is some risk adjustment done by CMS that considers age, medical history, etc. in determining what the monthly premium should be.

These PACE programs service the most medically complex and at-risk populations by keeping those who might ordinarily need nursing home care, living in the community.

The front line clinicians in these settings are doing the best they can with very limited decision tools. In other settings (where cost reduction is directly related to profit) I worry these “AI” models might be used in a way to simply cut costs rigging the guidelines for care provision that might adversely effect those who are simply expensive to care for.

My attitude is that clinical prediction models should be explicit, and assessed like insurance ratemaking models. Of particular interest is the dialogue that goes on between regulators and the insurance companies regarding the inputs to a model. The NAIC (National Association of Insurance Commissioners) has a few free publications that describe how insurance regulation balances the need for economic efficiency across state borders with autonomy at the local level.

Here is a recent document on the principles behind the calculation of insurance rates from the Casualty Actuarial Society on predictive models (pdf). This is the rigorous way of examining predictive models by entities and experts who have what Taleb refers to as “skin in the game.”

A more detailed discussion is found here (pdf) Insurance Rating Variables: What They Are and Why They Matter.

An actuarial ratemaking model has both a probability (predictive) component and a cost (utility) component. We might evaluate clinical prediction models as a subset of a full actuarial model, based upon how accurate the probability estimate is, without full consideration of the economics of the situation.

Great discussion. A couple of random thoughts:

  1. The discussion is largely about machine learning, not AI
  2. Machine learning algorithms need to be strongly tested to ensure that they do not use any post-time zero inputs
  3. The effect on the outputs of measurement errors in the inputs needs to be quantified
  4. Adversarial testing with respect to extrapolation needs to be carried out.

On the last point, just as in multivariable models with collinearities that are tested in independent samples with different collinearities and are found to give wild predictions, machine learning algorithms needs to be battle tested for the same problem.

1 Like

Perhaps this belongs in another thread, but here is another example where ML and “big data” failed, even within the most credible organization (Google) and a huge data set.

1 Like

I used to use Google’s Flu tracker when running influenza clinical trials and it was actually quite a useful and powerful tool at the time. My sense was that (in keeping with Frank’s comment above), it started to become less useful and less accurate is it started to influence people and then biased it’s own predictions based on that influence. However, I think the field understands how to do these types of models better now than they did in 2009. I think of this project as an intellectual forerunner to a bunch of Covid-19 symptom monitoring tools. For example, I ran Beat19, which surveyed 5,000 people daily about symptoms during 2020 in order to create a dataset about the time course of pre-symptomatic and symptomatic Covid-19 infection. Facebook did something similar with CMU and there were several app developers who also used symptom trackers to provide some actionable information. People also consider Watson Health to be a “failure” and I have my own opinions about what they did right and wrong, but I strongly believe that there will be successors that will succeed where these early attempts did not live up to the promise (or hype).

1 Like

I don’t think anyone argues there have been literally zero advances in ML, but my concern is 1. the excess hype, 2. agents making use of these models (ie. administration) and then shifting the consequences onto less powerful groups who are too disabled to advocate for their interests, or too busy working to engage in PR.

There remains a lot of what Harry Crane calls “Naive Probabilism” in health science, but there are no incentives for eventual correction, as there is in the insurance and actuarial realm.

Added: MIT Technology Review article on ML prediction during pandemic.

But many tools were developed either by AI researchers who lacked the medical expertise to spot flaws in the data or by medical researchers who lacked the mathematical skills to compensate for those flaws.

Wynants asked one company that was marketing deep-learning algorithms to share information about its approach but did not hear back. She later found several published models from researchers tied to this company, all of them with a high risk of bias. “We don’t actually know what the company implemented,” she says.


Interesting follow-up. I was recently asked with some colleagues to author a chapter in a textbook for clinicians on AI in medicine. This discussion has been helpful as I try to lay out all thing traps and mis-aligned incentives that the practicing clinician faces when presented with algorithms. Any other cautionary or advisory ideas are welcome.

I have a few blog articles on the subject at - see especially the one about whether ML has mesmerized medicine.

Thanks. The Web site refresh looks good.

Here are a few stat articles worth study. Conceptually the first article by Hand seems most important, as it describes 1. the diminishing marginal returns of incremental information 2. uncertainties introduced by the modelling process and changes in the population over time can render any incremental predictive accuracy irrelevant.

David J. Hand “Classifier Technology and the Illusion of Progress,” Statistical Science, Statist. Sci. 21(1), 1-14, (February 2006)

Evangelia Christodoulou, Jie Ma, Gary S. Collins, Ewout W. Steyerberg, Jan Y. Verbakel, Ben Van Calster. (2019) A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models,Journal of Clinical Epidemiology, Volume 110, Pages 12-22, ISSN 0895-4356

Austin, P. C., Harrell, F. E., & Steyerberg, E. W. (2021). Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting. Statistical Methods in Medical Research, 30(6), 1465–1483.