How to assess reliability/validity of a measurement based on differences in measurements?

Hanis_Kadir · August 1, 2025, 7:40am

Hi all,

I have a methodological question regarding reliability/validity testing for the Modified Weeks Test (MWT) in post-traumatic elbow stiffness assessment. The MWT involves measuring ROM differences before and after a preconditioning procedure (heat and passive stretch for 15 minutes).

In my study, two physiotherapist assessors measured elbow ROM in the following standardized position: patient sitting with back supported, upper arm at 90° shoulder flexion, and forearm in neutral rotation. We used a long arm goniometer placed with the axis over the lateral epicondyle, stationary arm along the humerus shaft, and moving arm pointing towards the dorsal aspect of the distal radio-ulnar joint.

When analyzing inter-rater reliability, I found high ICC values for individual measurements (pre-conditioning “cold” and post-conditioning “warm”) but very low ICC for the calculated MWT differences. Similar results occurred with test-retest reliability.

Here’s an example from my data:

Participant	Assessor 1			Assessor 2
ID	Pre-conditioning (°)	Post-conditioning (°)	MWT (°)	Pre-conditioning (°)	Post-conditioning (°)	MWT (°)
1	24	20	4	28	17	11
2	47	44	3	55	41	14

As shown, even when assessors’ pre and post measurements are relatively close, the calculated MWT differences can vary considerably. This might explain the low reliability for the MWT values.

My questions are:

Is it methodologically sound to conduct reliability and validity studies on measurements that are based on differences (since you don’t directly measure those differences)?
How should we approach reliability testing for such derived measurements?

Thanks!

davidcnorrismd · August 2, 2025, 1:24am

Could you tell us more about your experimental design? Are these paired assessments done on different performances, or on the same (e.g., videotaped) performance of the movement? Does each assessor independently position the patient, place the goniometer, etc? Are the 2 assessments done serially before, and then after, the very same conditioning procedure?

Hanis_Kadir · August 5, 2025, 7:32am

Thanks for the question.

These are paired assessments done on different live performances, not on the same performance. Each assessor enters the room separately and has the patient perform new movements for measurement. The first assessor positions the patient and marks the elbow and wrist landmarks, then each subsequent assessor uses these same landmarks but independently places the goniometer and measures their own set of movements. Yes, the two assessments are done serially before and after the same conditioning procedure - both assessors complete their measurements, then the heat pack conditioning is applied, and then both assessors repeat their measurements in the same order.

Order of Assessment:

First assessor enters, positions patient, marks landmarks, measures ROM
Second assessor enters separately, measures ROM using same landmarks
Heat pack conditioning (15 minutes with arm in gravity-assisted position)
First assessor returns to measure post-conditioning ROM
Second assessor returns to measure post-conditioning ROM
Modified Weeks Score calculated

davidcnorrismd · August 5, 2025, 8:07am

So many sources of variation here, including within-patient variation between the several performances! I would tend to approach this with a hierarchical Bayesian model, and my gut tells me that an experiment large enough to separate the different sources would be too expensive to conduct.

But to your question,

I would say that a generative model that characterizes the basic sources of variation would also enable you to compute variation in derived quantities. Indeed, such derived quantities are often included in Bayesian models to obtain these computations.

f2harrell · August 5, 2025, 11:33am

For a less generative approach see BBR which uses all possible pairwise relevant absolute differences for each source of disagreement.

Hanis_Kadir · August 6, 2025, 9:04am

Thank you, I will consider the above methods.

What could be the reason for high ICC values for individual measurements (pre-conditioning “cold” and post-conditioning “warm”) but very low ICC for the calculated MWT differences?

davidcnorrismd · August 7, 2025, 5:39am

I’m guessing that’s just a question of arithmetic? Consider as an extreme case 2 laser rangefinders that measure the distance between 2 points on opposite sides of some tectonic fault line. They probably agree almost perfectly on the absolute distance, but they still might disagree substantially on the tiny differences in that distance caused by minute tectonic movements.