We are studying a collection of reports which account for the number of diagnosed cases of a disease at different hospitals pre- and during the Covid-19 pandemic. Cases are counted over comparable time intervals (for instance March-April in 2019 and March-April in 2020). There are two variations of the disease (A and B) which are being reported. For instance, one report may say that 1000 patients were diagnosed prior to Covid-19 and 50 of them had type B, while 500 patients were diagnosed during Covid-19 and 10 had type B.
If we look at the ratio A/(A+B) before and during the pandemic, we see that the ratio has increased. However, this could be due to the absolute incidence of A increasing or the absolute incidence of B decreasing. We would like to know which is more likely the case.
Our initial thought is to estimate a multilevel negative binomial regression model with “time” as a covariate representing pre/during, and an adaptive intercept (and maybe also slope) for study, i.e., (using mean and dispersion parameterisation):
count of type A ~ negbin(mu,disp)
log(mu) = w0i + b0 + b1*time # + maybe w1i for adaptive slope
w0i ~ normal(0,sd_i) # Adaptive intercept per study
sd_i ~ normal+(0,1)
disp ~ normal+(0,1)
The same model can be estimated for the count of B, and then we can see which one has the greatest associated change with the pre/during pandemic variable “time”.
Our main concerns are
-
Some reports in our collection account for thousands of patients, while some less than 100. Should this be reflected in the model? Since there is no inherent uncertainty about the reported numbers (we want to study the number of diagnoses, and all diagnoses are accounted for), there is no reason to believe that large reports have less variation. In a meta-analysis of RCT:s for instance, it is common to give more weight to reports with large sample sizes, but we feel that this kind of thinking does not apply here.
-
Should we be approaching this problem from a completely different angle?
Note that we are well aware that the estimate of “time” in our model is confounded, and we are not attempting to attribute the association to a direct causal effect of Covid-19 on the diagnosis count (that kind of reasoning will be left for later).
Any ideas, thoughts, and comments are welcome!
Thanks!