A JAMA paper on School Closing Timing and Covid-19 prevalence and mortality

So a friend sent me a blog that referenced the paper linked below https://jamanetwork.com/journals/jama/fullarticle/2769034
given there are only 50 states and rather noisy measures, I was shocked to see the substandard statistical methodology that goes against F. Harrell’s suggestions in a paper that is so important for public policy decisions. I would love if a person such as Prof. Harrell with clout and more research experience than I have could read and review this paper. I will quote below from the paper:

SARS-CoV-2 testing rates varied by state and throughout the study period. To account for this variation, state-level COVID-19 testing12 (calculated daily as the cumulative number of tests per 1000 residents) was modeled as a categorical variable. State measures of urban population density (a measure of the state’s population density combined with the percentage of residents living in urban areas),11 percentage of the state’s population with obesity,13 percentage of the state’s population aged 15 years or younger,11 percentage of the state’s population aged 65 years or older,11 and the number of nursing home residents per 1000 people were included.14 The CDC social vulnerability index also was included; this index accounts for multiple factors, including socioeconomic status, household composition, disability status, race and ethnicity composition, English-language proficiency, housing type, and transportation access, to assess a community’s preparedness for a natural disaster or illness outbreak.15 Many of these factors have been associated with COVID-19 disease, mortality, or both. All covariates are described in detail in the eMethods in the Supplement.

Statistical Analysis

Interrupted time series analyses were used to compare the daily change in outcomes (daily COVID-19 incidence and mortality) before and after school closure. Acknowledging that school closure and other nonpharmaceutical interventions would not have immediate effects on COVID-19 incidence and mortality, estimates were used to determine when school-based exposure could be expected to lead to changes in COVID-19 incidence and associated mortality (eFigure in the Supplement). A time from exposure to symptom onset of 5 days was assumed per Lauer et al.16 Given the early emphasis (and some state restrictions17-25) on limiting testing to hospitalized patients, time to diagnosis was defined as time between symptom onset and hospitalization (7 days).26 For school closure, given the low documented prevalence of COVID-19 in children, an additional period was included for a child to infect an adult, assuming a child exposed at school could expose an adult prior to symptom onset and within 4 days. The analyses for the mortality outcome assumed 17 days from symptom onset to death.27 For details on lag period calculations, see the eMethods in the Supplement.

Daily COVID-19 incidence and mortality were modeled using negative binomial regression. Interactions between school closure and all included covariates were explored because school closure may affect at-risk communities differently.9 Given the large number of covariates and interactions considered, a single parsimonious model for each outcome was created selecting covariates from the primary model using a stepwise regression approach, with entry and removal criteria specified as a P value of <.20. Because factors associated with COVID-19 incidence and mortality may vary with school closure, covariate selection was completed independently during and after the lag period (eMethods in the Supplement).

I don’t have time to study in detail but will offer a few general comments.

  • categorization of variables is likely to result in residual confounding
  • stepwise variable selection is completely unreliable and likely to result in residual confounding
  • thinking that supervised learning (e.g. variable selection) effectively deals with the large number of covariates in any way is a big mistake. The parsimony apparently achieved is only a mirage.
  • the authors made some reasonable assumptions about things such as lag times but a Bayesian model that acknowledged uncertainty in these numbers would be more honest about what is unknown
1 Like