Estimating the probability of neonatal sepsis with plug-in incidence

samw235711 · September 10, 2025, 1:40am

Anyone have any experience with the neonatal early-onset sepsis model [1]? It takes variables like the mother’s highest temperature, Group B strep, etc, and estimates the probability that the newborn will develop sepsis. If estimated correctly, this can help make the decision of whether to give the newborn antibiotics. This is a difficult decision otherwise. Sepsis in the newborn is very bad, but these antibiotics can also have side effects, like ototoxicity.

In the calculator (here is a link), one can plug-in a population level incidence of disease based on, for example, CDC statistics. This seems a bit different to me. My understanding is that something like this would be done when it’s too expensive to collect variables on a random sample of the patients (which would allow the incidence to be estimated). In the derivation, they matched cases and controls. However, I’m not too familiar with this approach, so hoped to ask others.

Overall, it is great that the authors developed this calculator. Newborns are an under-studied population. The paper is also interesting. However, this is a very high-stakes decision problem, so better to discuss it, I think.

Puopolo KM, Draper D, Wi S, Newman TB, Zupancic J, Lieberman E, Smith M, Escobar GJ. Estimating the probability of neonatal early-onset infection on the basis of maternal risk factors. Pediatrics. 2011 Nov;128(5):e1155-63. doi: 10.1542/peds.2010-3464. Epub 2011 Oct 24. PMID: 22025590; PMCID: PMC3208962.

f2harrell · September 10, 2025, 11:43am

There is an old literature on retrospective vs. prospective logistic regression that shows how to apply a background incidence to correct for oversampling of outcomes. I was never convinced of the assumptions required, e.g., they probably assumed something like when all the covariates are at their means you get the population incidence. That’s a leap.

js592 · September 10, 2025, 9:24pm

Yes, AFAIK the literature on this comes from Prentice and Pyke (https://www.jstor.org/stable/2335158) in the case control setting and Manski and Lerman (https://www.jstor.org/stable/1914121) from a econometrics “choice based sampling” perspective. I find the original papers hard to disentangle, particularly with respect to the assumptions.

Scott and Wild (https://www.jstor.org/stable/3088796) give a bit more of an accessible treatment – from what I can tell, the critical assumptions are 1) that the true population prevalence is known; 2) the logistic regression model is correctly specified, and 3) the study is really done in a case-control sampling fashion. 1) is usually taken for granted, though perhaps it shouldn’t be. 2) is required because the “intercept-correction” factor relies on the score function/estimating equations. 3) is an important caveat given that uncontrolled observational data may not always be sampled in a case-control fashion, even if there appears to be a discrepancy between the sample case fraction and the population prevalence.

samw235711 · September 11, 2025, 12:44am

@f2harrell @js592 thank you. These responses align with my intuition. This is problematic. The American Academy of Pediatrics (AAP) guidelines state that “blood cultures and empirical antibiotic therapy are recommended for infants with an [early onset sepsis] EOS risk estimated at ≥3 per 1000 live births [where EOS is based on the calculator].”

This decision rule does appear to have been studied [2], which is reassuring. I need to look more closely at the study though. Either way, if its output is being used at this scale, the model needs to be as correct as possible. It has been shown to decrease antibiotic use, but how do we know it’s doing it correctly?

I also came across [2] on retrospective logistic regression but still reading it. Also kind of difficult to get through—I think there should be a fairly straightforward way to describe at least the model behind this. Will look at the linked paper. Having just thought about it quickly, I think it’s almost like fixing the intercept by design (i.e., design being 3:1 control:case), estimating the other coefficients, and then plugging in the intercept as one wants when calculating the probability. If so, as you said, I think both the model and the variance estimators would depend on too many assumptions, and, also as said, an incidence is difficult to estimate.

Kuzniewicz, Michael W., et al. “A quantitative, risk-based approach to the management of neonatal early-onset sepsis.” JAMA pediatrics 171.4 (2017): 365-371.
Prentice R. Use of the logistic model in retrospective studies. Biometrics. 1976 Sep;32(3):599-606. PMID: 963173.

f2harrell · September 11, 2025, 11:24am

One thing I’ve never understood is how can you fix the intercept without involving covariate distributions in the population.

js592 · September 11, 2025, 4:16pm

Most of the econometrics literature actually focuses on the general case using the joint likelihood p(y, x|\theta) for inference, so I think you’re probably correct. My guess is that the intercept correction method only works if the logistic regression model is correctly specified.

f2harrell · September 11, 2025, 9:22pm

Besides being correctly specified, I’m guessing that there are also assumptions about relative covariate distributions in the two populations.