Estimating totals at group level

robinblythe · October 15, 2024, 2:10am

I have the opportunity to collect some very rich data from an institutional database. My main concern is that I’ll probably only get one shot at a data extract, and want to have a carefully specified analysis plan.

The primary research question is to estimate the average cost to the health system (outpatient, surgery, etc) by group for around 80 groups, with multiple observations per individual in those groups corresponding to a single interaction with the health system. Some groups will have thousands of members, but some will likely have very few (probably less than 5). Individuals will likely only belong to a single group. I may be able to get an extract of individuals who aren’t a member of any of those groups, to serve as controls.

My intention was to use the data to build a microsimulation to estimate the average lifetime cost for an individual in any given group compared to controls, using a regression model to estimate the mean cost per interaction by group. Each row in the dataset will constitute an individual’s interaction with the system, so I think it would be straightforward to turn those interactions into states within a simulation. However, I’m worried that with 80 groups, that will involve ~80 simulation models, and with only a handful of members in some groups there will be a ton of variance.

Would something like a mixed-effects model be more appropriate here? I think it would be fair to assume that the random effects would be normally distributed, but it’s difficult to see what the utility of such a model would be. I would also be worried about correlation between the individual level and group level variables, given individuals will probably only belong to one group.

f2harrell · October 15, 2024, 12:05pm

This doesn’t answer your question but be sure to capture dates of all costs so you can account for time trends flexibly and be able to do any discounting you want.

robinblythe · October 22, 2024, 5:18am

Thanks and agreed, although I may only be able to get as granular as what year the costs were incurred. I suspect something like day or month will be considered ‘re-identifiable’.