I’d appreciate the group’s thinking on a methodological question that arises in equine racing fatality epidemiology, though the underlying issue generalizes to any setting where a rare terminal outcome is modeled using repeated exposure units nested within subjects.
Large observational studies of racing fatalities use the individual race start as the unit of analysis, with fatality (yes/no) as the outcome. A recent study reported approximately 3.85 million starts contributed by roughly 251,000 unique horses, with a median of 11 starts per horse (IQR 5–22). Multivariable logistic regression models are fit to identify risk factors for fatality.
When asked whether the non-independence of starts within a horse is accounted for, the investigators offered two arguments:
1. During the modelling process, the variation contributed by individual horses was verified to be small.
2. Horse-level clustering is unlikely to matter because each horse can only die once — the outcome does not repeat within a cluster.
I have concerns about both arguments and would welcome reactions.
I suspect this was assessed via the ICC from a random intercept model. With an event rate approximating 1 per 1,000, a small ICC seems nearly guaranteed — there is almost no outcome variance to partition at any level. Is the ICC interpretable for rare binary outcomes, or is it mechanically driven toward zero regardless of whether meaningful horse-level heterogeneity exists? Would a comparison of robust versus model-based standard errors, or a likelihood ratio test of the variance component, be more informative?
Regarding the terminal outcome, this is the one I find most problematic. The non-recurrence of the outcome doesn’t seem to address the actual clustering concern. Starts within a horse share unmeasured characteristics, conformation, genetics, subclinical pathology, training regimen, risk tolerance of connections. The non-independence is in the exposure and covariate structure, not in the outcome. A hospital mortality analogy: each patient can only die once, but if patients contribute multiple admissions, we would still account for patient-level clustering.
There is also a related selection issue. Horses that accumulate many starts are survivors by definition, the risk set at start 20 has been purged of horses that died or were retired due to injury. If frailer horses exit early and contribute fewer starts, the denominator is progressively enriched for lower-risk observations. This seems like informative censoring that could bias the fatality rate downward and attenuate covariate effects.
Am I right that the “can only die once” argument conflates the terminal nature of the outcome with the independence assumption on the denominator side? And is there a clean way to demonstrate ,perhaps via simulation ,that ignoring clustering in this setting can meaningfully affect inference even when the ICC appears negligible?
Any pointers to relevant literature or analogous problems in other fields would be very welcome.