Hi folks.

I’m looking for a little bit of help with different measures of reliability, and test-retest reliability specifically. I would be thrilled if anyone could help. And if anyone finds it easier to explain using R code, that would make things very much clearer for me.

When we do test-retest studies, our group has, and my field (molecular imaging) has, traditionally reported ICC(1,1) according to Shrout & Fleiss. I have a biostatistics reviewer on a paper that we’ve submitted who made an incredible list of suggestions, including the recommendation that ICC(1,1) would be a pretty bad choice for test-retest study, which was pretty mindblowing, considering that (1,1) appears in just about all the test-retest reliability papers that have ever been produced within my field. So I’ve been looking into the literature, and found a really nice flow chart in this article here (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4913118/), which agrees with the reviewer, and suggests that I should really be going for Mcgraw & Wong’s ICC(A,1), their two-way mixed effects, absolute agreement, single rater/measurement ICC. Great information to have!

So that’s all fine and well - next I want to calculate it to understand it better. There are formulas provided in the article, but there’s also a little description that I find myself quite skeptical of/confused by. I’ll quote it below:

“For the same ICC definition (eg absolute agreement), ICC estimates of both the 2-way random- and mixed-effects models are the same because they use the same formula to calculate the ICC (Table 3). This brings up an important fact that the difference between 2-way random- and mixed-effects models is not on the calculation but on the experimental design of the reliability study and the interpretation of the results.”

So this leaves me scratching my head for a few reasons:

A) If they are exactly the same definitions, then I’m not sure I quite understand why there exist both as separate metrics? Apparently study design, but I struggle to understand why both would exist as different outcome measures. Is this an error in the paper, or is it really just this way?

B) Does mixed effects mean something different from a multilevel model here? I imagine running this with ANOVA vs running this with lme4 or nlme in R, but perhaps it’s a different definition that I don’t understand. Or do both methods just produce exactly the same estimates? This also seems a little weird to me…

So, perhaps just for my own sake, and to get some kind of an understanding, *if* one actually *does* calculate the mixed model ICC using a mixed model framework (in R, lme4 or nlme), then how would I go about doing that, and how would I go about calculating the new MS values using that. I could probably do it with a normal ANOVA (or even without having to actually run the model, by just calculating the MS values directly), but I’m a little at a loss for a mixed effects strategy. I’ve found a few packages in R which provide ICC estimates, so that I could try to reverse-engineer their code, but I can find none which actually appear to implement a mixed-effects modelling framework (I looked at psych, psy, irr, irrNA).

Also, in case you happen to know: to my experience, as well as based on the confidence intervals around ICC estimates, it appears as if ICC(1,1) and the other ICC estimates of absolute agreement for single-rater reliability, appear to match each other pretty well, and won’t easily be differentiable except for in absolutely enormous samples. If anyone happens to know of contradictory examples, that would help in letting me know that my intuition is incorrect.

Thank you *so* much in advance to anyone who can help!