Two stage hierarchical model for multiple outcomes

In RCTs in cognition research, the primary outcomes are usually summary measures (e.g., a weighted average of several reaction time tasks) which putatively measure some latent cognitive ability, such as short term memory. It is sometimes the case that these summary measures inappropriately average over cognitive tests that measure distinct cognitive processes. So, as a secondary/exploratory analysis, it can be valuable to look at each of the cognitive tests individually.

An immediate problem with this approach is that, with a large number of tests (~20), the treatment effect estimates will be noisy, and spurious evidence of benefit/harm may emerge.

One solution is to use a hierarchical model as described by Gelman et al. (who specifically use multiple cognition outcomes as an example). This way, we estimate the standard deviation of the treatment effect across the different outcomes, and partially pool the estimates accordingly. Our treatment effect estimates are then less noisy and less error prone.

But this doesn’t seem workable when outcomes are a mix of reaction time tasks, accuracy tasks etc, each of which are very differently distributed and need to modeled in different ways. One overall hierarchical model isn’t an option.

So, my question is: is it possible to perform several independent models, retain the treatment effect estimates from each, and then partially pool them in a second modelling stage?
In other words, is it possible to take a vector of treatment effect estimates from different models and appropriately shrink them towards their common mean? Or is there a different/better option available?

The only two ways I know of that I would recommend are the following:

  • Peter O’Brien’s approach of ranking all the outcomes across subjects then averaging the ranks across outcome measures, then doing a Wilcoxon test
  • Coming up with an ordinal summary measure based on combining all the outcomes and analyzing this using a univariate ordinal response regression model
3 Likes

when i have encountered such cognitive scores i displayed the estimates in a forest plot with the estimate for the aggregated scores at the bottom (usually there are several levels where aggregation occurs). Incidentally, as @f2harrell says, i think proportional odds modelling would be useful here although you dont see it in the literature, it is usually the difference between mean scores that is given, but even the subject matter experts dont know exactly how to interpret the magnitude of the effect on that scale. Also, i thought these scores were designed to be aggregated and are not considered meaningful when teased apart?

“a mix of reaction time tasks, accuracy tasks etc” - i know these outcomes, they are rarely analysed properly, it is time to complete tasks and number of tasks completed? so some joint model is needed?

2 Likes

I had never heard of this approach. It makes a lot of sense for testing a ‘global’ treatment effect without relying on summary scores. I will read about O’Briens approach in more detail.

This is what I had done initially - combining mean differences and log odds ratios into a forest plot. What I had not done is include the aggregate score at the bottom. That’s a good idea as it will make clear that if there is no evidence for a benefit on the aggregate scale, large effects on individual tests can probably be considered somewhat suspect. Otherwise you do not make any formal adjustments for multiple comparisons?

Most commonly it is response time and the proportion of correct responses from N trials. There probably is a good generative joint model for this process, and I should look into that. People tend to treat the RT and accuracy components as independent, and model both as Gaussian. Obviously with accuracy scores this produces a terrible fit to data, particularly when there are floor/ceiling effects. The shifted lognormal for RT and beta binomial models for accuracy seems to fit the data very well, though it still does (problematically) consider the accuracy and RT independently…

1 Like

Two ideas that come to mind:

  1. If you could imagine overall utility as some linear combination of partial utility functions for each outcome (no interactions) then stochastic multicriteria acceptability analysis might help you to cast things in terms of decision theory. Relatively easy to output summaries like the relative importance of outcomes required to prefer one therapy over the other and how confident you can be in that ranking. Tommi Tervonne (Tommi Tervonen's personal web space).

  2. Could you link outcomes with multivariate model of some sort (see merlin maybe) and then convert to probability of meaningful benefit across multiple outcomes or probability of improvement across x without significant loss on y. You can explore this idea by post processing your individual models based on ideas in this paper (Extensions of the probabilistic ranking metrics of competing treatments in network meta‐analysis to reflect clinically important relative differences on many outcomes) and if it makes sense you could do the work of finding an actual multivariate model that would work.

Both of these ideas miss out on the benefits of Gelman approach of sharing information across outcomes and instead just try to turn your 20 outcomes into 1.

Something else that comes to mind is maybe to adapt ideas from this other multivariate NMA paper (Network meta-analysis of multiple outcome measures accounting for borrowing of information across outcomes | BMC Medical Research Methodology | Full Text) where maybe if you could express your treatment effect as some sort of ratio effect then you could specify each outcome as an outcome specific effect + your ratio measure with the ratio measure being shared across outcomes and benefiting from that shrinkage effect. I’ll get some side-eyes for this but one option would be to convert everything to SMDs and use this model. SMDs aren’t perfect but it’s on option if you are stuck (12.6.3 Re-expressing SMDs by transformation to odds ratio). You might also want to consider multiple levels where eg some outcomes are more related than others (https://www.tandfonline.com/doi/abs/10.1080/10543406.2010.520181?journalCode=lbps20) .

2 Likes