Clustering in the denominator: non-independence of starts in racing fatality studies

trumanfrancis · March 13, 2026, 5:20pm

I’d appreciate the group’s thinking on a methodological question that arises in equine racing fatality epidemiology, though the underlying issue generalizes to any setting where a rare terminal outcome is modeled using repeated exposure units nested within subjects.

Large observational studies of racing fatalities use the individual race start as the unit of analysis, with fatality (yes/no) as the outcome. A recent study reported approximately 3.85 million starts contributed by roughly 251,000 unique horses, with a median of 11 starts per horse (IQR 5–22). Multivariable logistic regression models are fit to identify risk factors for fatality.

When asked whether the non-independence of starts within a horse is accounted for, the investigators offered two arguments:

1. During the modelling process, the variation contributed by individual horses was verified to be small.

2. Horse-level clustering is unlikely to matter because each horse can only die once — the outcome does not repeat within a cluster.

I have concerns about both arguments and would welcome reactions.

I suspect this was assessed via the ICC from a random intercept model. With an event rate approximating 1 per 1,000, a small ICC seems nearly guaranteed — there is almost no outcome variance to partition at any level. Is the ICC interpretable for rare binary outcomes, or is it mechanically driven toward zero regardless of whether meaningful horse-level heterogeneity exists? Would a comparison of robust versus model-based standard errors, or a likelihood ratio test of the variance component, be more informative?

Regarding the terminal outcome, this is the one I find most problematic. The non-recurrence of the outcome doesn’t seem to address the actual clustering concern. Starts within a horse share unmeasured characteristics, conformation, genetics, subclinical pathology, training regimen, risk tolerance of connections. The non-independence is in the exposure and covariate structure, not in the outcome. A hospital mortality analogy: each patient can only die once, but if patients contribute multiple admissions, we would still account for patient-level clustering.

There is also a related selection issue. Horses that accumulate many starts are survivors by definition, the risk set at start 20 has been purged of horses that died or were retired due to injury. If frailer horses exit early and contribute fewer starts, the denominator is progressively enriched for lower-risk observations. This seems like informative censoring that could bias the fatality rate downward and attenuate covariate effects.

Am I right that the “can only die once” argument conflates the terminal nature of the outcome with the independence assumption on the denominator side? And is there a clean way to demonstrate ,perhaps via simulation ,that ignoring clustering in this setting can meaningfully affect inference even when the ICC appears negligible?

Any pointers to relevant literature or analogous problems in other fields would be very welcome.

f2harrell · March 13, 2026, 5:47pm

This really avoids your excellent question, but it may be worth taking stock of the likely correlation structure. A random intercept instantiates a compound symmetric correlation pattern that doesn’t respect the forward flow of time. A serial correlation pattern, e.g., a Markov process, may be more likely to hold. This comment would apply more to a non-death outcome perhaps.

Related to death, this is a terminal event and is most elegantly handled as an absorbing state in a state transition model, at least in other settings I’m more familiar with.

trumanfrancis · March 28, 2026, 4:19pm

I thought about this in light of Efron’s paper on the Stein paradox which I love because baseball was really what got me interested in statistics. Batting averages and ERA are what we had growing up… its a much richer milieu of statistics now.

stein_whitepaper.pdf (869.9 KB)

steinsparadox.pdf (965.8 KB)

timdisher · March 30, 2026, 1:52pm

I might be thinking of this wrong but my first thought would be to treat each start as an exposure and have n horses with different numbers of races so you’d get something like an exponential survival model? Something like death ~ XB + offset(starts)?

davidcnorrismd · March 31, 2026, 8:04am

James, often your outlook on these problems seems to be that of a reviewer of a paper, which I think puts you at a disadvantage very much like that of the Gordian Knot - Wikipedia. Better simply to cut through the thing!

The whole problem set-up here reminds me of the Efficient-market hypothesis - Wikipedia. There are (I am guessing) many well-informed actors — trainers, vets, jockeys — making decisions about when/whether/how to race a horse. These decisions are also made on the basis of private information, even such as a trainer’s ‘gut feel’ I would suppose. So the risks a statistician might hope to detect have already been thoroughly ‘priced in’.

One way to cut the knot would be to posit a rational-actors model, and attempt to relate the risk of fatal MSI to the (time-dependent) economic value of the horse to its owners, who would be viewed as solving some kind of stochastic optimization problem.