Predicting rare outcomes

The rate of breakdowns of racehorses in the US is 1.22 per 1000 horse starts. Numerous attempts are being made to predict which horses are on the verge of breakdown from diagnostic imaging methods, biomarkers and biometric variables as the attached article mentions. What are the limitations of predicting such a low rate event? Since we have difficulty in predicting more common events what are the chances that a useful model could be developed?

The biometric data that could predict racehorse injury | Pursuit by The University of Melbourne (

IMO, the biggest challenge you might face is relatively small effective sample size resulting from the low number of injury outcomes (unless you have a massive number of horse starts). This will bound how precisely you can estimate predicted probabilities and model coefficients unless you can incorporate strong prior information with a Bayesian approach. This article outlines an approach you could use to determine a minimum sample size for the estimation or prediction goal at hand (

Thank you for the reference. I was wondering if it is worthwhile to develop a risk prediction model for such rare events. As an example, I don’t think there is a useful model for predicting sudden cardiac death in human athletes which is an even rarer occurrence. If the prevalence/incidence of an event is low, isn’t it likely that a prediction of the adverse event is likely to be a false positive?

Yes, the base rate fallacy is a real thing but I don’t think it really applies to a rigorously developed predictive model – keep in mind most derivations thereof assume some “test comes back positive” or “patient is told they tested positive” which implicitly hides some forced choice classification procedure. If the event is truly rare then (provided your model is well-calibrated) most of the predicted risks will still be quite low. So with a good enough sample size you can still get useful results.

1 Like

What are your thoughts on Firth’s correction for low event prediction models?

IJERPH | Free Full-Text | Bring More Data!—A Good Advice? Removing Separation in Logistic Regression by Increasing Sample Size | HTML (

It looks like the paper is subtly advocating a Bayesian approach: “In FC, the likelihood function is penalized by the Jeffreys invariant prior”, which is in line with what I’d probably go with personally for dealing with separation.

1 Like

The Firth correction is a very mild form of penalization. Too mild.

1 Like

Funny—this exact problem (only in athletes, not horses) is what I’m working on right now.

One alternative to trying to predict the event itself is to move a step or two back up the causal chain that leads to injury. It sounds like horse breakdown is the same fundamental problem as stress fractures in athletes: cumulative damage from biomechanical forces that eventually leads to bone fracture.

So, instead of trying to build a model to predict injuries, which are rare, you could try to build a model to predict the biomechancial forces that cause damage (or predict the amount of cumulative damage itself, e.g. after a given training session). Now, calculating biomechancial forces is not exactly straightforward, but it’s a lot more practical with a small sample size (maybe a few dozen horses + a biomechanics lab) vs. waiting around for years to get many thousands of horse injuries.

Trying to go directly from high-resolution sensor data to injury risk poses another problem: you find yourself in an extreme version of Gelman’s “garden of forking paths”–even a run-of-the-mill sports watch can give you second-to-second measurements of an athlete’s speed, cadence, heart rate, and more; but injuries only happen after many hundreds of hours of training. So, even with thousands of athletes (or horses), you have millions of datapoints per athlete.

Without a framework for analyzing these data, there are so many ways to slice and dice the sensor data (“change in X in the last Y% of …”, “change from baseline in Z…”, “change in X starting Y weeks prior to injury…”) that the possibilities for false positives seems very high. I think these kinds of problems can only be approached with a framework that’s informed by the biological process that’s causing the event (in this case, cumulative fatigue failure, which is a well-studied phenomenon in engineering).

So, model the bone damage, and plug it into a probabilistic model of cumulative fatigue failure. Here’s a nice review from my field on this topic for anyone interested.

1 Like

I applaud your approach. I think that a realistic mixture of biomechanics expertise and empirical analysis will work. You can try to pre-specify analyses but there is a lot of explore. You can make the explorations safe by using the bootstrap to penalize for some of the exploration.