Mixed vs Random Effects ICC


Hi folks.

I’m looking for a little bit of help with different measures of reliability, and test-retest reliability specifically. I would be thrilled if anyone could help. And if anyone finds it easier to explain using R code, that would make things very much clearer for me.

When we do test-retest studies, our group has, and my field (molecular imaging) has, traditionally reported ICC(1,1) according to Shrout & Fleiss. I have a biostatistics reviewer on a paper that we’ve submitted who made an incredible list of suggestions, including the recommendation that ICC(1,1) would be a pretty bad choice for test-retest study, which was pretty mindblowing, considering that (1,1) appears in just about all the test-retest reliability papers that have ever been produced within my field. So I’ve been looking into the literature, and found a really nice flow chart in this article here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/), which agrees with the reviewer, and suggests that I should really be going for Mcgraw & Wong’s ICC(A,1), their two-way mixed effects, absolute agreement, single rater/measurement ICC. Great information to have!

So that’s all fine and well - next I want to calculate it to understand it better. There are formulas provided in the article, but there’s also a little description that I find myself quite skeptical of/confused by. I’ll quote it below:

“For the same ICC definition (eg absolute agreement), ICC estimates of both the 2-way random- and mixed-effects models are the same because they use the same formula to calculate the ICC (Table 3). This brings up an important fact that the difference between 2-way random- and mixed-effects models is not on the calculation but on the experimental design of the reliability study and the interpretation of the results.”

So this leaves me scratching my head for a few reasons:
A) If they are exactly the same definitions, then I’m not sure I quite understand why there exist both as separate metrics? Apparently study design, but I struggle to understand why both would exist as different outcome measures. Is this an error in the paper, or is it really just this way?
B) Does mixed effects mean something different from a multilevel model here? I imagine running this with ANOVA vs running this with lme4 or nlme in R, but perhaps it’s a different definition that I don’t understand. Or do both methods just produce exactly the same estimates? This also seems a little weird to me…

So, perhaps just for my own sake, and to get some kind of an understanding, if one actually does calculate the mixed model ICC using a mixed model framework (in R, lme4 or nlme), then how would I go about doing that, and how would I go about calculating the new MS values using that. I could probably do it with a normal ANOVA (or even without having to actually run the model, by just calculating the MS values directly), but I’m a little at a loss for a mixed effects strategy. I’ve found a few packages in R which provide ICC estimates, so that I could try to reverse-engineer their code, but I can find none which actually appear to implement a mixed-effects modelling framework (I looked at psych, psy, irr, irrNA).

Also, in case you happen to know: to my experience, as well as based on the confidence intervals around ICC estimates, it appears as if ICC(1,1) and the other ICC estimates of absolute agreement for single-rater reliability, appear to match each other pretty well, and won’t easily be differentiable except for in absolutely enormous samples. If anyone happens to know of contradictory examples, that would help in letting me know that my intuition is incorrect.

Thank you so much in advance to anyone who can help!


This avoids your real question, but you did raise the point about absolute agreement. I prefer U-statistics that compute mean absolute discrepencies. This is easily done between raters or within raters, and the bootstrap can provide confidence intervals. U-statistics involve taking all possible relevant pairs of observations (when the kernel is a 2-vector). Details and a link to R functions may be found in Chapter 16 of BBR. This works a little better when the response being analysed is metric or possibly ordinal, but also works when it is binary.


Not really an answer, but it may be of some help…
The Stata manuals are generally very good for theoretical aspects as well as the practical use of Stata. They are also readily available on the internet. The one for ICC is here…maybe it helps…


Thank you both! This is fantastically helpful!

I will definitely give U statistics a more detailed look! I find all the metrics of agreement to have important flaws for specific use-cases, so having a wider range to call upon is really helpful, and I’m looking forward to digging in!

And thank you so much for the suggestion from the STATA manual! That’s really awesome - I had no idea that there was such detailed documentation available. This looks like it’ll be really helpful through the process of breaking it all down to figure it out, as they provide a bit more description of the intermediate steps.

I also asked this question in PsychMAP on Facebook, and got a really helpful answer from Craig Hedge from Cardiff University. I’ll quote it below:

I think the random- and mixed- effects ICCs do have different underlying definitions, it’s just that their calculation from the variance components in an ANOVA framework are identical. This can boil down to something similar to the distinction between a sample statistic and a population statistic (e.g. n vs N as the size of a sample or population respectively). If you look at the left column of Table 4 in McGraw and Wong (1996), the definitions for ICC(A,1) differ in how they denote column variance. The random effects ICC (Case 2A) uses sigma, denoting an estimate taken from random samples in the population. The mixed effects ICC (Case 3A) uses theta, reflecting the assumption that you’ve assessed all the raters you’re interested in. You calculate the ratio of the variance components in the same way, though!

The use of fixed and random is similar to the terminology in mixed-effects models - McGraw and Wong (1996) explain this a bit more on pg. 37 (under the heading “Column Variables Representing Fixed Versus Random Effects”). It is possible to calculate ICCs from variance components from (e.g.) a lmer model, but the traditional ICCs that McGraw and Wong cover are all done in an ANOVA framework. If you’re using the psych package in R, I think McGraw and Wong’s ICC(A,1) is what’s reported as ICC2 (Single random raters). It’s the same value that SPSS reports when selecting either two-way mixed or two way random models, with type ‘absolute agreement’ and the ‘single measure’ value.

The variation in terminology does get a bit confusing!

I still feel a little confused about the random vs mixed terminology, since there is, in the psych package, an option to set lmer=TRUE, and obtain the same metrics, while apparently the mixed effects model in McGraw & Wong wasn’t even calculated using a mixed effects model :.

In any case, I’ll have some time to dig into the R code next week and see if I can start figuring out where all these methods differ and where they’re the same. Will post updates, unless someone else is able to solve everything in the meantime.

Thank you again to everyone!


Is Cohen’s kappa no longer considered useful?


While I’ve not used Cohan’s kappa before, I’m pretty sure it is used for nominal-scale data. Although as far as I recall, it is analogous to the ICC for categorical data. In my case, my data is ratio-scaled, so it’s not applicable.


Thank you. The primary outcome for an intervention that I look at in veterinary medicine is a subjective ordinal six level scale (0-5) that does not have consistent intervals. I have seen studies looking at inter-rater and intra-rater agreement and most of them used Cohens kappa. I was told it is very sensitive to sample size and prevalence variations. Since I was thinking of doing a meta-analysis on these studies I wanted to be certain it was a valid outcome measure.