Proper methods for analysis of ordinal outcome measures

In various allied health disciplines, there is a growing push from academia to use “validated, evidence based” scales and outcome surveys.

A common one is the Disabilities of Arm, Shoulder, and Hand – aka the DASH as we therapists refer to it.

Essentially, there are from 27 to 30 Likert items. An average is calculated, and this raw score is converted to a number from 0 (no perceived disability) to 100 (total perceived disability). A “minimum clinically important difference” is defined as 12.75% - 17.23% change in score.

These are generally used in the context of providing evidence to third party payers that therapy is effective in an individual case, and should be continued (or ended in case of lack of progress or goals met).

I’ve known for a long time that parametric statistics have been controversial on Likert type data like this. But it appears that those who defend the use and those that oppose it no longer write in the same publications. Those who teach research methods to clinicians are generally the supporters of parametric analyses in these scenarios.

Front line therapists will never understand the measurement issues unless they have an inquiring, and contrarian mind.

Some of my concerns generally:

  1. The obvious problem of regression to mean (in the sense of return towards central tendency).
  2. The initial score, and the outcome score can clearly come from non-normal distributions. Initial entry into a clinic for injury are going to have scores skewed towards high scores, and on discharge, towards low ones.

It would seem to suffer all of the problems that changes from baseline have.

How does one know if the finite variance assumption that justify parametric analyses, are plausible here?

  1. Obviously, the permutations of the possible scores are normally distributed, but that is just an artifact of the scale, not of the underlying phenomena.

Could this simply be a case where the assumptions of parametric analysis is violated, but it is a small error not worth worrying about?

FYI – he is referring to this paper here:
Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2692323

I have thought the information from the distribution of scores (ie. the histogram) was interesting information. Perhaps some sort of Wilcoxon signed rank method would be preferable. I’m really not sure.

Any suggestions?

The average is immediately a red flag. We’re throwing a bunch of information in the trash.

Mathematically, an average may minimize the sum of distances between \mu and any data point. But let’s be practical. Consider insurance claims data, or genetic data. Many, many genes expression are 0,1,2, and some are very high, in the tens of thousands. Likewise, many insurance claims are 0, and then we have a few observations that are very high (someone gets hit by a bus, etc). If we take an average, we’re getting something somewhere in the middle. It’s not practically informative.

If we take an average, we’re not at all representing the data accurately. Many statistics classes teach that an average is a good way to summarize data, but it’s not.

Really, not all “parametric analysis” need to satisfy the normality assumption. If you’re performing a two-sample t-test, we need to know that X is normally distributed (and we need to perform another test prior so that we know the data is normally distributed within reason). However, for a regression model, this is cases need not be satisfied.

Yes, most data is not normally distributed.

Not my area of expertise, but it’s still possible to do an analysis with parameters.

Their expertise lies elsewhere. Extremely intelligent and competent people, but from my experience, they don’t spend time thinking about data. There are exceptions, but it’s rare.

~~

But to address the title of this post, I think it’s best to write a statistical ordinal regression model.

An excellent, reproducable, and explicit example with code of ordinal modeling is by Burkner.
A reference is “Ordinal Regression Models in Psychology” https://journals.sagepub.com/doi/abs/10.1177/2515245918823199?journalCode=ampa

First, thanks for that reference. I will definitely take a look.

Blockquote
Really, not all “parametric analysis” need to satisfy the normality assumption. If you’re performing a two-sample t-test, we need to know that X is normally distributed (and we need to perform another test prior so that we know the data is normally distributed within reason). However, for a regression model, this is cases need not be satisfied.

No but you would need a credible argument that variance is finite (ie. there is an equal spacing at each level on the scale). Parametric analyses like t tests are robust for validity (size is preserved) under non-normality (and finite variance), but not robust for power. That is why techniques such as robust and adaptive methods exist.

As best as I understand the issue: if a skeptic argues the error distribution could be Cauchy (infinite variance), the mean is undefined. Of course, you can always calculate a sample mean. But in this scenario, the mean (and standard deviation) are not sufficient statistics. Inferential techniques based on the mean in this scenario are incorrect. Neither size, nor power are preserved. Quantiles have meaning, however, and can be estimated.

I would argue the metric information is an illusion of forcing the responses onto a finite scale. The percentage is meaningless.

(If any of the above is incorrect or in some sense not rigorous, someone please correct me.)

Yeah I agree.

Sure, but when do we really have an infinite population? Particularly in clinical data we’re dealing with some subset. Does it even make sense to define an infinite population stratified on ethnicity?

Sure, I can see why this might be true from an asymptotic perspective, but ultimately we’re drawing inferences from finite populations. Please cite a reference and I’ll check it out.

I just threw this out as a trivial example.

In most real life scenarios, we’ll get data that don’t satisfy normality assumptions. In my case. I’ll specify a conservative, low variance prior on a regression coefficient and if there’s not much power, we’ll have a draw from the prior. No need for power. We can’t always get “high power” studies in real life, for practical reasons

The issues are very well described in this article:

Blockquote
Finally, is it valid to assume that Likert scales are interval-level? I remain convinced by the argument of Kuzon Jr et al.,9 which, if I may paraphrase it, says that the average of fair and good is not fair-and-a-half; this is true even when one assigns integers to represent fair and good!