Is there an accepted methodology for determining how accurate a device that measures physiologic variables like HR, HR variability, acceleration, stride length etc might be? Is it simply comparison to a “gold standard”? What if one does not exist? I have seen some that merely calculate a mean measure and consider measurements 2 or more standard deviations away as “abnormal”.
Using the standard deviation from the sample mean to detect outliers is problematic because the mean is the most sensitive summary statistic to any particular data point.
(Section 4.4.2 describes the mathematical properties of the mean that make it unsuitable as a measure of location for detecting abnormal values)
https://hbiostat.org/bbr/descript
Finding highly influential observations (ie. “abnormal” values) requires either the use of a procedure like the jackknife to see how much the mean changes when an observation is removed, or the use of summary statistic that is less responsive to certain abnormal values, such as the median.
An old thread discusses the issue of dealing with outliers. I particularly like the intro sections of Thomas O’Gorman’s books on adaptive procedures that discusses how to think about the problem from a principled frequentist perspective. He has also provided comparative data on various textbook procedures using simulations that demonstrate the benefits and drawbacks relative to the adaptive procedures he derived.
https://discourse.datamethods.org/t/provisions-for-outilers-and-strategies/4222/2?u=r_cubed
There is a bit going on here. First, you probably want to distinguish between HR/HRV and gait variables because the physics of measurement are quite different. For HR/HRV, different “devices” have different levels of reliability e.g. field optical wrist measurement < field commercial chest strap < electrodes with conducting gel in a controlled environment. For gait measurements there is also something similar. Something like an indirect sensor located away from the legs/feet < foot pods < image tracking or direct accelerometer measurement in a lab.
I guess that was a long way of saying, yes, for the variables you mentioned there should be a “gold” standard but one would probably need to consider the environments the measurements are taken in as well.
General suggestion: Estimate the average, median, and pseudo-median of the absolute differences between two ways of measuring the same thing. And get bootstrap confidence intervals for these overall discrepency measures. See also https://hbiostat.org/bbr/obsvar.
Given a proposal to compare 5 IMU sensor systems that have algorithms that are black box and simply output an ordinal risk of injury of low medium high, and having access to 500 horses in training how would you design a study to see which of the sensors more accurately predicts musculoskeletal injuries in racehorses.