I am working on an exploratory analysis of multilevel Cox models. We have several different models with outcomes related to CVD, Obesity, and Cancers, but all models have a competing risk of death. We are using individual level EHR data and socioeconomic data averaged over the zip code level and linked to individuals based on address at baseline. To account for the zip code level covariates, we are using zip code as a cluster variable. Model n’s range from 450,000 to 850,000. These individuals comprise about 5,000 zip codes, with zip code clusters ranging from n=1 to 10,000+.

I’ve been tasked with calculating the C-index for these models, but I have several questions regarding how to handle the multilevel aspect.

Based on Van Kleveren et al (2014), I believe the best way to approach this is to calculate the C-indices within each cluster, then find the “average” of that cluster in some way.

I’m not sure how to handle the singleton/small clusters. We have many clusters with n<10. Maugen et al (2013) suggests that the C-index values for clusters with n<10 are highly influenced by the proportion of censoring within the cluster.

Would I first need to re-run the Cox models excluding small clusters? Or would I still use the prediction values from the full sample model to evaluate the subset with n>10 clusters?

These papers also suggest either using meta-analysis or a weighted average to determine the overall C-index from the individual cluster values. Is there a specific approach you would suggest here?

We would also like to compare our new multilevel model to a parsimonious model containing only individual level covariates. Would it be appropriate to compare the averaged C-index from the clustered data to the C-index from the individual level only models?

Welcome to datamethods Kelsey. These are great questions that have not been sufficiently explored in the literature or in practice. There are some background issues that are probably worth discussing first.

Though hierarchical models incorporating random effects can explain more outcome heterogeneity, it’s not without cost. Random effects are real parameters. With 5000 clusters you might be estimating effectively 50 whole parameters, but these are still possibly costly in terms of efficiency (and are certainly costly in terms of computational efficiency).

It would be very interesting to compare AICs on a model using certain fixed effects, i.e., region characteristics, to one that used clustering. You’d have to be careful about the d.f. used in the AIC formula when using random effects, i.e., you’d probably have to solve for effective degrees of freedom given the estimated random effects variance.

Whether using clustering or not, having region characteristics included to the maximum is beneficial as it will greatly lower the variance of the random effects, lowering the effective d.f. of the random effects and making the model converge faster and probably making some standard errors lower.

More directly to your main questions, there are times when the clustering represent a nuisance that you want to subtract out and not get predictive discrimination “credit” for. This might be true for example in a multi-site clinical trial. In other cases, perhaps including when the underlying variable represents population density, local poverty, or vulnerability, you may want to get predictive discrimination credit for the clusters or variables that describe the clusters. In that case computing c-indexes within clusters will lower the c-index.

To the extent that you want to get credit for outcome effects arising from geographical information and to the extent that the estimated random effects (BLUPs ?) are stabilized (shrunken) by the mechanism that results in a not-large estimate of the random effects variance, including the random effect estimates in the overall model will then likely give you a very reasonable overall c-index estimate.

Note that the c-index is pretty useful for judging pure predictive discrimination, when the censoring pattern does not greatly affect the index, but in general it’s not a full-information measure (it only uses ranks) so is not always efficient. That is really true when comparing two models. Things like generalized pseudo R^2 as described here are very useful though a bit harder to interpret. I prefer the one that adjusts (penalizes) for effective predictor d.f. and uses the effective sample size for Y. If you can figure out how the cluster d.f. figures into this, that should apply. Or possibly better omit clusters and use multiple geographic variables as discussed above.