Looking for guidance on model design for cohort study with recurrent events




I have data from a study that followed 100 children from birth until their 2nd birthday. Weekly stool samples were examined for presence of a virus and parents were asked whether the child had suffered diarrhea since the last visit.

Samples positive to the virus were not always linked to diarrhea and diarrhea often occurred in the absence of the virus of interest.

We are interested in estimating parameters that would characterize the following:

  • Duration of viral infection
  • Duration of diarrhea
  • Frequency of diarrhea
  • How all of the above depends on the specific variant of the virus

Intuitively, I feel I should be modelling a base diarrhea/infection rate and a change of diarrhea rate/duration based on presence of infection. However, I’m not sure how to get started. Incorporating the measurement uncertainty given the week intervals is also one of my goals, though I suspect this should be trivial compared to setting up the rest of the model.

I would appreciate any recommendations on how to approach this analysis. Perhaps there’s a common name for the implied model setup, a textbook or paper showing how to implement it? I am familiar with Stan and R in case it matters.


An excellent question. Besides having baseline assessments, you have a bivariate outcome: diarrhea and viral infection. It can be difficult to answer, but what are the concerns regarding lag times. That is, do you want to only consider the relationship between diarrhea “now” and viral infection status “now” or do you thing the analysis should program in some kind of lag? Also do you have viral load quantified? The study may lack precision/power for a binary outcome.


Seems that your intuition about modeling rates and prevalences do not in fact relate to the main hypotheses concerning the duration of infection. Duration of viral infection could be estimated from the repeated serologies that you obtain during weekly stool samples. Duration of diarrhea similarly from the parental self-report. Using a time-series of presence-absence indicators, there are a few approaches one may take. Fitting a repeated measures logistic model with GEE or GLMM can give you a rough geometric probability model for the duration of outcome. This would be further be refined by adding genotype indicators.

Some considerations: diarrhea is a condition, and the stooling is a symptom. Diagnostic procedures are described here https://www.uptodate.com/contents/acute-diarrhea-in-children-beyond-the-basics. What most parents call “diarrhea” in their kids is just watery stool, and many watery stools is not many diarrheas. So be clear if you are measuring frequency of watery stools or incidence of diarrhea–requiring that, in one child, one diarrhea episode must resolve then recur to total 2 incident diarrhea episodes.

Polyinfection: you must clearly state and describe the effect of simultaneous infection, a common problem from exposure to the types of unclean conditions that cause infection (Hep A + giardia + …). Confirming other possible diagnoses will greatly increase the precision of estimates and reduce possible biases.


Good point. As the virus has an incubation period of up to 4 days, it is reasonable to consider both the “now” and a one-week lag between diarrhea and infection status.

We do have viral load measurements available. I imagine this would be a better measure than mere presence/absence.

Thanks for the suggestion. I’m not completely clear on how to specify such a model. If each measurement is modeled as a binary response, I don’t see how incident infections could be told apart from single sustained episodes. I suppose I could make a model for incidences and a separate one for outcome duration, where the latter only considers measurements where the outcome is present and an additional one at the end of each episode to signal its end.

Many thanks on your additional comments. I verified that a proper definition of diarrhea is being used and indicators for coinfection are available.