Mixed vs Random Effects ICC

Hi folks.

I’m looking for a little bit of help with different measures of reliability, and test-retest reliability specifically. I would be thrilled if anyone could help. And if anyone finds it easier to explain using R code, that would make things very much clearer for me.

When we do test-retest studies, our group has, and my field (molecular imaging) has, traditionally reported ICC(1,1) according to Shrout & Fleiss. I have a biostatistics reviewer on a paper that we’ve submitted who made an incredible list of suggestions, including the recommendation that ICC(1,1) would be a pretty bad choice for test-retest study, which was pretty mindblowing, considering that (1,1) appears in just about all the test-retest reliability papers that have ever been produced within my field. So I’ve been looking into the literature, and found a really nice flow chart in this article here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/), which agrees with the reviewer, and suggests that I should really be going for Mcgraw & Wong’s ICC(A,1), their two-way mixed effects, absolute agreement, single rater/measurement ICC. Great information to have!

So that’s all fine and well - next I want to calculate it to understand it better. There are formulas provided in the article, but there’s also a little description that I find myself quite skeptical of/confused by. I’ll quote it below:

“For the same ICC definition (eg absolute agreement), ICC estimates of both the 2-way random- and mixed-effects models are the same because they use the same formula to calculate the ICC (Table 3). This brings up an important fact that the difference between 2-way random- and mixed-effects models is not on the calculation but on the experimental design of the reliability study and the interpretation of the results.”

So this leaves me scratching my head for a few reasons:
A) If they are exactly the same definitions, then I’m not sure I quite understand why there exist both as separate metrics? Apparently study design, but I struggle to understand why both would exist as different outcome measures. Is this an error in the paper, or is it really just this way?
B) Does mixed effects mean something different from a multilevel model here? I imagine running this with ANOVA vs running this with lme4 or nlme in R, but perhaps it’s a different definition that I don’t understand. Or do both methods just produce exactly the same estimates? This also seems a little weird to me…

So, perhaps just for my own sake, and to get some kind of an understanding, if one actually does calculate the mixed model ICC using a mixed model framework (in R, lme4 or nlme), then how would I go about doing that, and how would I go about calculating the new MS values using that. I could probably do it with a normal ANOVA (or even without having to actually run the model, by just calculating the MS values directly), but I’m a little at a loss for a mixed effects strategy. I’ve found a few packages in R which provide ICC estimates, so that I could try to reverse-engineer their code, but I can find none which actually appear to implement a mixed-effects modelling framework (I looked at psych, psy, irr, irrNA).

Also, in case you happen to know: to my experience, as well as based on the confidence intervals around ICC estimates, it appears as if ICC(1,1) and the other ICC estimates of absolute agreement for single-rater reliability, appear to match each other pretty well, and won’t easily be differentiable except for in absolutely enormous samples. If anyone happens to know of contradictory examples, that would help in letting me know that my intuition is incorrect.

Thank you so much in advance to anyone who can help!


This avoids your real question, but you did raise the point about absolute agreement. I prefer U-statistics that compute mean absolute discrepencies. This is easily done between raters or within raters, and the bootstrap can provide confidence intervals. U-statistics involve taking all possible relevant pairs of observations (when the kernel is a 2-vector). Details and a link to R functions may be found in Chapter 16 of BBR. This works a little better when the response being analysed is metric or possibly ordinal, but also works when it is binary.


Not really an answer, but it may be of some help…
The Stata manuals are generally very good for theoretical aspects as well as the practical use of Stata. They are also readily available on the internet. The one for ICC is here…maybe it helps…


Thank you both! This is fantastically helpful!

I will definitely give U statistics a more detailed look! I find all the metrics of agreement to have important flaws for specific use-cases, so having a wider range to call upon is really helpful, and I’m looking forward to digging in!

And thank you so much for the suggestion from the STATA manual! That’s really awesome - I had no idea that there was such detailed documentation available. This looks like it’ll be really helpful through the process of breaking it all down to figure it out, as they provide a bit more description of the intermediate steps.

I also asked this question in PsychMAP on Facebook, and got a really helpful answer from Craig Hedge from Cardiff University. I’ll quote it below:

I think the random- and mixed- effects ICCs do have different underlying definitions, it’s just that their calculation from the variance components in an ANOVA framework are identical. This can boil down to something similar to the distinction between a sample statistic and a population statistic (e.g. n vs N as the size of a sample or population respectively). If you look at the left column of Table 4 in McGraw and Wong (1996), the definitions for ICC(A,1) differ in how they denote column variance. The random effects ICC (Case 2A) uses sigma, denoting an estimate taken from random samples in the population. The mixed effects ICC (Case 3A) uses theta, reflecting the assumption that you’ve assessed all the raters you’re interested in. You calculate the ratio of the variance components in the same way, though!

The use of fixed and random is similar to the terminology in mixed-effects models - McGraw and Wong (1996) explain this a bit more on pg. 37 (under the heading “Column Variables Representing Fixed Versus Random Effects”). It is possible to calculate ICCs from variance components from (e.g.) a lmer model, but the traditional ICCs that McGraw and Wong cover are all done in an ANOVA framework. If you’re using the psych package in R, I think McGraw and Wong’s ICC(A,1) is what’s reported as ICC2 (Single random raters). It’s the same value that SPSS reports when selecting either two-way mixed or two way random models, with type ‘absolute agreement’ and the ‘single measure’ value.

The variation in terminology does get a bit confusing!

I still feel a little confused about the random vs mixed terminology, since there is, in the psych package, an option to set lmer=TRUE, and obtain the same metrics, while apparently the mixed effects model in McGraw & Wong wasn’t even calculated using a mixed effects model :.

In any case, I’ll have some time to dig into the R code next week and see if I can start figuring out where all these methods differ and where they’re the same. Will post updates, unless someone else is able to solve everything in the meantime.

Thank you again to everyone!

1 Like

Is Cohen’s kappa no longer considered useful?

1 Like

While I’ve not used Cohan’s kappa before, I’m pretty sure it is used for nominal-scale data. Although as far as I recall, it is analogous to the ICC for categorical data. In my case, my data is ratio-scaled, so it’s not applicable.


Thank you. The primary outcome for an intervention that I look at in veterinary medicine is a subjective ordinal six level scale (0-5) that does not have consistent intervals. I have seen studies looking at inter-rater and intra-rater agreement and most of them used Cohens kappa. I was told it is very sensitive to sample size and prevalence variations. Since I was thinking of doing a meta-analysis on these studies I wanted to be certain it was a valid outcome measure.

Thank you for this interesting discussion. We will also look to U statistics in our future papers, but previously have always relied on ICCs as the norm for measuring repeatability. I am often confused by the varying formulas and am glad for the above discussion and helpful article referenced above by Koo and Li to interpret McGraw&Wong’s structure better. Other authors have advocated that a minimal detectable change (MDC95) corresponds to the real world better. This is a bit odd since the MDC95 could perhaps be viewed as an intermediary calculation towards an ICC since it is essentially a standard error. I also wonder if, at a theoretical level, the MDC95 might be more similar to a U-statistic than an ICC?

In our last paper, our biostatistician calculated the ICC with with a linear mixed model on log-transformed data, using R’s lme command from the nlme library. I’m not sure this even fits into the 10 McGraw & Wong forms which as noted above seem to be based in ANOVA. However, the results are really close to the two-way ANOVA for the ICC(A,1) case 2A/3A. Here is a link to the raw data and the lme results (R code not included).

We then used the ICC to calculate the MDC95. The results surprised me in that the L and R dorsal forearm, which had very different ICCs (0.49 and 0.85) ended up with very similar MDCs (90.4 and 84.0 N/m) due to the large difference in population standard deviation (calculation shown in the raw data link above). Going forward with longitudinal studies to track patient changes over time, I am now not sure if we should assume that these two measurement sites have roughly equal reproducibility due to the similar MDC in N/m, or if the site reliability is actually quite different due to the difference in ICC (which admittedly have overlapping confidence intervals due to the small sample size). Any insights would be greatly appreciated!

To me, the MDC95 seems to make the most sense since ultimately we want to determine whether we can be confident that a change in a patient’s measure over time was likely true biological change, or whether the change might simply fall under measurement error.

Also, the best explanation we found to calculate MDC95 was Beaton’s 2000 article in spine. Specifically, I quote the MDC95 equation here (where r is essentially ICC I believe):

When is Change Greater than Just Measurement Error?
Reliability data along with standard deviation of the measure in that sample of patients can be combined to define the standard error of measurement (SEM = SD1 · √(1 − r), where SD1 = standard deviation at baseline, and r = reliability coefficient—in this case, the test–retest reliability coefficient.
This is then put into a formula in which the lowest change that can confidently be considered as exceeding measurement error and noise is defined as Δ = 1.96 · √2 · SEM or Δ = 2.77 · SEM. This is at a 95% confidence level (thus the use of 1.96 = z score for 95% confidence interval). Stratford and Binckley have coined the term minimum detectable change (MDC) for this value. In radiology this is considered the smallest detectable change. Ravaud J Clin Epi 1999 [47]

Sorry to be slow to get back to this.

So, importantly, reliability describes how well your measurement can discriminate between individuals. In other words, can it do a good job of ranking individuals from high to low in the true thing that you are attempting to measure. When the measurement error is high relative to the amount of differences between individuals, then you will do a bad job. I recently had a discussion with someone who gave me an example to show that reliability is useless, and I believe it does exactly the opposite. He said that the same thermometer would be more reliable in, say, Alaska than it would be at the equator, and therefore reliability says very little about the thermometer and more about the climate. However, it depends on what the goal is: if you wish to determine whether the weather outside is a hot or cold day relative to the other days where you live, then indeed the same thermometer will do a better job in Alaska. This is because reliability is expressly interested in comparisons between individuals. If you wish to know how accurate that thermometer is, then you have a different question. By the way, also worth mentioning that the ICC has error bars, and in small samples, they can be enormous! There are also tools for performing power analysis for calculating the ICC. This could well explain your discrepancies.

Measurement error, on the other hand, is essentially the same thing as the reliability, except that it is not normalised to the variance: rather it is either the absolute error, or normalised to the mean (e.g. 5% error). If you wish to ask who which individuals in your group got a relatively high or low score, this is irrelevant. However, if you want to assess whether an individual likely got better or worse from their first measurement to their second measurement, then this is a much more relevant thing to look at. I like to refer to the measurement error as a measure of repeatability, to differentiate it from reliability. In this category is the MDC95, for example.

Re how to calculate MDC and SEM, I have seen this formulation before, but it’s a slightly indirect way of calculating the measurement error. I go into some detail about these metrics in the paper that this question was actually referring to, which is now out here. Hopefully it’s helpful!


Thank you so much Granville for your thoughts and congratulations on the article which helped clarify some of these points. The bottom line well-illustrated in your figure from the paper seems to be indeed that ICC will depend on the population you happen to have at hand. Since in our research we are interested at knowing the device characteristics and what change within a single patient over time is real (above the noise level), we will focus our interpretation on the MDC95 (or SDD in your paper). However, we have always used Bessel’s correction for our standard deviation (i.e. n-1) since we assume our measurements are a sample from rather than the entire population of measurements. I assume in your paper “sample standard deviation” means the one with Bessel correction rather than the direct calculation of the standard deviation of the sample itself?

Sorry for missing this. Yes, we use Bessel’s correction. Really happy that the paper was helpful! :slight_smile:

I am aware this is an old question, but I guess it would be helpful to give a more conclusive answer because I had the same question last week.

The paper Shrout and Fleiss (1979) defines clearly each intraclass correlation.

Let x_{ij} denote the i-th rating on the j-th target for i = 1, \ldots, k and j = 1, \ldots, n. For one-way random effects, they consider the following model:

(1) x_{ij} = \mu + b_{j} + w_{ij}

where \mu is the overall mean, b_j \sim N(0, \sigma^2_B) is the difference between the overall mean and target j, and w_{ij} \sim N(0, \sigma^2_W) are the residual components containing the inseparable effects of the rater, the interaction effect rater \times target and the random error \epsilon_{ij}.

For the two-way random effects and mixed effects, they consider the following model:

(2, 3) x_{ij} = \mu + a_{i} + b_{j} + ab_{ij} + \epsilon_{ij}

where \mu and b_j are defined as in the previous model, a_i is the difference between the overall mean and rater i, and \epsilon_{ij} \sim N(0, \sigma^2) are the random errors. The difference between the models are:

(2) In the two-way random effects model, there is the additional assumption that a_i \sim N(0, \sigma^2_A).
(3) In the two-way mixed effects model, a_i is a fixed effect with the constraint \sum_{i = 1}^ka_i = 0. The parameter corresponding to \sigma^2_A is \theta = \sum_{i = 1}^k a_i^2/(k - 1).

Furthermore, in the absence of repeated ratings by each rater on each target, the components ab_{ij} and \epsilon_{ij} cannot be estimated separately. Nonetheless, the interaction needs to be considered when calculating the ICC formulas,

(2) In the two-way random model, (ab)_{ij} \sim N(0, \sigma^2_I).
(3) In the two-way mixed model, we need the constraint \sum_{i = 1}^k(ab)_{ij} = 0 to have an identifiable model. A consequence of this constraint is that Cov((ab)_{ij}, (ab)_{ij'}) = -\frac{\sigma^2_I}{k - 1}.

Then, the following table with Mean Squares (MS) and Expected Mean Squares (EMS) is presented:

Source of Variation MS EMS: One way EMS: Two-way random EMS Two way mixed
Between targets MS_R k\sigma^2_b + \sigma^2_W k\sigma^2_b + \sigma^2_I + \sigma^2 k\sigma^2_b + \sigma^2
Within targets MS_W \sigma^2_W \sigma^2_a + \sigma^2_I + \sigma^2 \theta^2 + f\sigma^2_I + \sigma^2
Between raters MS_C - n\sigma^2_a + \sigma^2_I + \sigma^2 n\theta^2+ f\sigma^2_I + \sigma^2
Residual MS_E - \sigma^2_I + \sigma^2 f\sigma^2_I + \sigma^2

where f = k/(k - 1).

Intraclass definitions are given in Koo and Li (2016).

Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychological bulletin. 1979 Mar;86(2):420

Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine. 2016 Jun 1;15(2):155-63.

In R,


n.subject <- 100
n.rater <- 100

random.subject <- rep(rnorm(n = n.subject, mean = 0, sd = 1), each = n.rater)
random.rater <- rep(rnorm(n = n.rater, mean = 0, sd = 1), n.subject)
random.error <- rnorm(n = n.subject*n.rater, mean = 0, sd = 1)

id <- as.factor(rep(1:n.subject, each = n.rater))
rater <- as.factor(rep(1:n.rater, n.subject))

### one-way random effect
design.matrix <- model.matrix(~ id)
response <-random.subject + random.error
dt <- data.frame(id, rater, response)

## fiting a model
fit <- lmer(response ~ 1 + (1|id), data = dt)
sm <- summary(fit)

# ICC (1, k)
msr <- as.numeric(n.rater*sm$varcor$id + sm$sigma)
msw <- as.numeric(sm$sigma)
icc <- (msr - msw)/(msr + (n.rater + 1)*msw)

### two-way random effects

## data
response <- design.matrix%*%unique(random.rater) + random.subject + random.error
dt <- data.frame(id, rater, response)

## fiting a model
fit <- lmer(response ~ 1 + (1|id) + (1|rater), data = dt)
sm <- summary(fit)

# ICC (2, k)
msr <- as.numeric(n.rater*sm$varcor$id + sm$sigma)
mse <- as.numeric(sm$sigma)
msc <- as.numeric(n.subject*sm$varcor$rater + sm$sigma)
icc <- (msr - mse)/(msr + (msc - msr)/n.subject)

### two-way mixed-effects

## data
design.matrix <- model.matrix(~ rater)
response <- design.matrix%*%unique(random.rater) + random.subject + random.error
dt <- data.frame(id, rater, response)

## fiting a model
fit <- lmer(response ~ rater + (1|id), data = dt)
sm <- summary(fit)

# ICC(3, k)
msr <-  as.numeric(n.rater*sm$varcor$id + sm$sigma)
mse <- as.numeric(sm$sigma)
icc <- (msr - mse)/msr