Multilevel modeling and appropriate use of clusters

michael_aj · March 5, 2021, 12:30am

Hi All,
Hopefully this is a fairly simple question for all of you, but my familiarity with multilevel models is a bit lacking.

I’ve been asked to work with a dataset with an outcome related to individual animals, some information on the individual animal and year, and then several variables describing the size and type of farm, as well as an individual identifier for farms. Many of these farms have contributed several animals to the dataset, with larger farms generally being more likely to have more observations. The objective is inferential modeling to estimate the temporal trend in the data.

Initially, I had planned to use clusters based on the farm identifiers and use animal information, year, and both farm size and type as fixed effects in the model, however I’m concerned that this would not be appropriate because the farm size and farm type variables are on the same level as the clustering.

Is a more reasonable approach to use a multilevel model that clusters first on farm size and farm type, then clusters within that by individual farms? Or would it be no better than clustering on individual farm without including farm size and type in the model?

Thanks!
Disclaimer: My program doesn’t discuss multilevel modeling much and sticks to frequentist statistics, so I’m still working on teaching myself these models and Bayesian statistics.

s_doi · March 5, 2021, 5:41pm

James Hodges has suggested that new style random effects should not be used for inferential modelling. Given that ignoring clustering generally means that variances are underestimated, perhaps estimating the standard errors using the Huber-White sandwich estimators is the answer (automatically done in many packages e.g. vce(cluster farm) in Stata where farm is an indicator for the farm each animal came from). Whether farm type or size should be in the model depends on the goal of the analysis. If the goal is to make claims about the relationship between these variables and the dependent variable then the inclusion of these variables should be based on an examination of the causal structure of the problem under analysis, perhaps using a DAG.