Appropriate model for combined data from two RCTs

I am working on a secondary data analysis which uses individual patient data from two randomized controlled trials. For our study, randomization was broken by only including patients with at least 1-year follow-up.

Both studies used the same longitudinal outcome (same time points), same treatments, and had the same set of clinically relevant covariates.

The goal is to estimate the average treatment effect using the combined data. I don’t have specific training in this area, but I think my current approach can be considered a one-stage IPD meta-analysis.

What would be the most appropriate model choice for this scenario? I’m specifically wondering if a two level (2 RCT study groups) random-intercept term is ok to use in this situation. Some options I’ve considered:

  • Mixed model
    • outcome ~ outcome_baseline + treatment + time + covariates + (1 + time | study/id)
      • Random intercepts for subject ID nested within study group.
      • Only two study groups means the variance estimate between study groups is not informative, but if we ignore the variance estimate, does this still provide a relevant estimate of treatment effect?
    • outcome ~ outcome_baseline + treatment + time + covariates + study + (1 + time | id)
      • Even though we are not interested in its coefficient, treat study group as a fixed effect.
  • GLS model
    • formula = outcome ~ outcome_baseline + treatment + time + covariates + study
    • correlation = corCAR1(form = ~ as.numeric(time) | id)
    • weights = varIdent(form = ~ 1 | study)
      • Marginal model approach.


I highly recommend the book on IPD meta-analysis by Richard Riley et al.

1 Like

And not that the minimum number of studies for which random effects can be used is 4.

True. I forgot the reason though. Why exactly?

A key parameter is the variance of the random effects. It really takes 70 clusters to nail down the variance. The random effects are not identifiable with 1 cluster, not identifiable with 2 clusters unless the variance is known, and things are unstable with 3+ clusters. I hope someone can find a real reference about this.

An informative prior on the random effect SD could help with this issue.

Yes, if you don’t mind that prior being very influential. I don’t know the theory but I would tend to put study as fixed effects.

To Frank’s question, there was a good discussion about this general subject on CrossValidated (CV):

What is the minimum recommended number of groups for a random effects factor?

One of the references listed there is a question in Ben Bolker’s GLMM FAQ here:

Should I treat factor xxx as fixed or random?

Ben is the current maintainer of the R lme4 package.

Quoting from Ben’s FAQ:

One point of particular relevance to ‘modern’ mixed model estimation (rather than ‘classical’ method-of-moments estimation) is that, for practical purposes, there must be a reasonable number of random-effects levels (e.g. blocks) – more than 5 or 6 at a minimum. This is not surprising if you consider that random effects estimation is trying to estimate an among-block variance.

The other discussion on the CV link above, including a reference to Gelman and Hill (2007):

Data Analysis Using Regression and Multilevel/Hierarchical Models

would suggest that to have fewer than 5 or 6 levels for the random effects may not be any worse than if one added the factor as a fixed effect in a classical model setting, but it might be worth running both to effectively conduct a sensitivity test, presuming that the mixed effects model, in the case being discussed here, converges with only two study levels.

1 Like

So the model which uses study-level random intercepts isn’t an option in this case.

I skimmed the IPD meta-analysis book. I’ll need to learn more about the two-stage method they discuss. For the one-stage method, they make the distinction between common, stratified, and random effects. A common effect is a fixed-effect that can be assumed in common across studies and a stratified effect is a fixed-effect that can be assumed distinct across studies (so an interaction with study term).

They seem to advocate for a common effect for the treatment effect and stratified effects for covariates other than the treatment effect. I don’t understand why they also discuss using a random effect for treatment, which a treated/control variable would seemingly lead us to a two-level random intercept that Frank ruled out. Or they are using the treatment variable as numeric with 0/1 or -0.5/0.5 coding and using it as a random slope? Anyone familiar with these models know how to implement this using an R mixed-model formula?

When they say random-effect for treatment, they use a random slope for treatment and random intercept for study.

I’m feeling more inclined to go the GLS route structured as

  • formula = outcome ~ treatment * time + study * (outcome_baseline + covariates)
  • correlation = corCAR1(form = ~ as.numeric(time) | id)
  • weights = varIdent(form = ~ 1 | study)

And compare that to the results from the two-stage method.