Preprint: Analysis of Likert Type Data Using Metric Methods

Dennis Boos and Judy Chen (2022). Analysis of Likert-Type Data Using Metric Methods.

:new: I edited this to bring the main points earlier, and the decision theoretic justification later in the post.

The authors defend the common practice of using parametric models on ordinal data on the basis of pragmatics (ie.“easier” to for researchers understand), and perceived robustness in that parametric methods on empirical or simulated data sets are not sufficiently inaccurate that skepticism about the validity or reproducibility of the result is unjustified.

They perform simulations under modest departures from normality that travel well known paths for those familiar with the analytic and simulation results going back decades.

While the authors do discuss the important paper by Liddell and Kruschke, there is no reference to a more recent paper by Simkovic and Trauble, who examine the issue from the perspective of measurement theory, and come to different conclusions.

Building upon the theoretical work in the Schroder and Yitzhaki paper above, the author below (Jeffrey Bloem) is charitable in the extreme to the use of metric models such that he develops a procedure to estimate the possible bias caused by scale mis-specification. He closes this interesting paper with:

The cardinal treatment of ordinal variables can lead to incorrect empirical findings. Although some empirical findings may be robust to alternative monotonic increasing transformations, many will not be.

Summary of Criticisms
The use of parametric methods on ordinal data is an overlooked cause of the replication crisis in psychology and a large, heterogeneous body of research on patient reported outcomes in pain and various rehabilitation interventions.

The problems with their defense of parametric models on ordinal data that only get noticed when discussed at the meta-analytic or information synthesis perspective. Floor and ceiling effects a local study level interact with selective reporting (publication bias), and disrupt the expected error correcting process of scientific inquiry.

  1. The early critics of parametric models on ordinal data noted that arbitrary scale transformations could change the observed sign of the effect. Liddell and Kruschke demonstrate this in their paper, but it wasn’t addressed by the authors. Simkovek and Trauble note detrimental effects on \beta and \alpha which is compatible with the points made in Liddell and Kruschke. Increases in error probabilities drive the information gained from the experiment downward, possibly to 0 – ie. no better than flipping a coin. From the meta-analytic perspective, this is a fatal defect that makes any evidence synthesis fruitless, as there isn’t even a p-value combination procedure that can consistently infer the sign of the effect. If a reader cannot trust the sign bit, why should they trust any other bits that represent the magnitude of effect? Such an experiment is not credible to any rational skeptic nor advocate. The implication is that no information is communicated by parametric models on ordinal data. Mathematical psychologist Joel Michell considers this practice (treating ordinal as metric), and its continued defense “pathological science.” Meta-analysis only magnifies this.

  2. As the authors below point out (but do not emphasize) uninformative parametric analyses are more likely to be included in a meta-analysis, but a more defensible ordinal analyses are likely to be ignored, even when it is possible to include them. There are also complexities of combining conceptually similar assessments that use different scales.

  1. Ordinal methods, by not depending on any particular scale, do provide a valid p-value and estimate (the Hodges-Lehman estimate or the Generalized Odds Ratio), based on the notion of stochastic ordering. This elegant model simply asks: “Is a randomly selected value in Y > X?” This permits various methods of information synthesis including more recently developed techniques based upon confidence distributions.
  1. These brute mathematical facts would seem to make the simulations an elaborate exercise of begging the question. The entire history of papers defending the practice either conduct simulations under charitable assumptions, or point to data sets where there is minimal difference in inference. A meta-analyst must take a skeptical, pre-data perspective. Given that there is no way to guarantee or verify the assumption of linearity, the interaction of floor and ceiling effects with effect size and variance and departures from normality increases bias and uncertainty. Simulation studies should also be conducted under the least favorable distribution (ie. Cauchy). Analytic and simulation results would eliminate the t-test from consideration in this case. Parametric scale effects simply add a source of heterogeneity to the task of synthesis.

Decision Theoretic framework justifying criticisms
Greenland (2020) proposes an information theoretic perspective for thinking about statistical methods.

Shafer proposed conceptualizing scientific communication as a betting game (See Testing by Betting:A strategy for statistical and scientific communication).

JL Kelly linked betting games to Shannon’s information theory in his historic paper 1956 A New Interpretation of the Information Rate (link). It provided a semantic model for the notion of an information advantage via gambling.

Betting arguments are also critical in proving the coherence of Bayesian methods. Bayesian decision theory and information theoretic considerations merge in the design of experiments, where maximizing utility is equivalent to maximization of information.

In an effort to improve scientific communication, Robert Matthews extended IJ Good’s “method of imaginary results” to derive honest skeptic and advocate priors when presented a frequentist interval.
He calls this method the Bayesian Analysis of Credibility.

Working from the Bayesian perspective on the probability scale, the evidential value of any experiment is how much the posterior shifts from the prior, in light of data. Quantitatively, that is: \frac{p(\theta|data)}{p(\theta)}. This is the relative belief ratio.

A Bayesian is free to use frequentist tools for computational simplicity.

We might think of an experiment as a computational procedure to decide a question of fact between an honest skeptic, and an honest advocate. An advocate would object to any analysis method that reduces the quantity 1-\bar\beta while a skeptic would object to any method that inflates \alpha.

Bayarri, Benjamin, Berger, and Selke link Bayesian concepts of evidence for hypothesis testing (Bayes Factors) with frequentist error probabilities via the equation they call the Bayesian Rejection Ratio:
O_{pre} = \frac{\pi_{0}}{\pi_1} \times \frac{1 - \bar\beta }{\alpha}.

Using the Bayesian Rejection Ratio, the skeptic can derive the a level, in light of his prior odds on the plausibility of the null, to calculate an observed p value that is sufficiently surprising, he or she accepts the observed sign as real. This is better known as the “significance” level.

The evidential value of an experiment (before the data are seen) are proportional to a ratio of frequentist error probabilities. It can be seen that reductions in expected power (1 - \bar\beta) or increases in a reduce the information value of an experiment, regardless of whether one takes a Bayesian or Frequentist perspective.

Rafi and Greenland (2020) recommend changing p-values to information measures via a log transform. The -log_2(p) is the Shannon information provided by the experiment (ie. “bits”), while the more common -ln(p) provides natural units (“nats”) of information. Either can be the basis of a method to aggregate the information contained in multiple data sets. An earlier paper by Greenland and Poole connect p-values and lower bounds for Bayesian posteriors of sign errors.

As noted above, using parametric models on ordinal data can either decrease 1-\bar\beta, increase \alpha or both. It may not even provide a reasonable p value that correctly estimates the sign of the effect. Increasing error probabilities reduces the information from the experiment. These adverse consequences are avoidable by using ordinal models, and are not only valid a the individual study level, but open up the possibility to do a useful research synthesis.

Further resources
Michell, J (1986) Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, Vol 100(3), Nov 1986, 398-407 APA PsycNet

Explores the relationship between measurement scales and statistical procedures in 3 theories of measurement within psychology—the representational, the operational, and the classical. It is asserted that the representational theory implies a relation between measurement scales and statistics, although not the one mentioned by S. S. Stevens (1946) or his followers. The operational and classical theories, for different reasons, imply no relation between measurement scales and statistics, contradicting Stevens’s prescriptions. It is concluded that a resolution of this permissible-statistics controversy depends on a critical evaluation of these different theories.

Michell, J. (2000). Normal Science, Pathological Science and Psychometrics. Theory & Psychology, 10(5), 639–667.

A pathology of science is defined as a two-level breakdown in processes of critical inquiry: first, a hypothesis is accepted without serious attempts being made to test it; and, second, this first-level failure is ignored. Implications of this concept of pathology of science for the Kuhnian concept of normal science are explored. It is then shown that the hypothesis upon which psychometrics stands, the hypothesis that some psychological attributes are quantitative, has never been critically tested. Furthermore, it is shown that psychometrics has avoided investigating this hypothesis through endorsing an anomalous definition of measurement. In this way, the failure to test this key hypothesis is not only ignored but disguised. It is concluded that psychometrics is a pathology of science, and an explanation of this fact is found in the influence of Pythagoreanism upon the development of quantitative psychology.

Barrett, P. (2003). Beyond psychometrics: Measurement, non-quantitative structure, and applied numerics. Journal of Managerial Psychology, 18(5), 421–439. link

A statement from Michell (Michell, J., “Normal science, pathological science, and psychometrics”, Theory and Psychology, Vol. 10 No. 5, 2000, pp. 639-67), “psychometrics is a pathology of science”, is contrasted with conventional definitions provided by leading texts. The key to understanding why Michell has made such a statement is bound up in the definition of measurement that characterises quantification of variables within the natural sciences. By describing the key features of quantitative measurement, and contrasting these with current psychometric practice, it is argued that Michell is correct in his assertion. Three avenues of investigation would seem to follow from this position, each of which, it is suggested, will gradually replace current psychometric test theory, principles, and properties. The first attempts to construct variables that can be demonstrated empirically to possess a quantitative structure. The second proceeds on the basis of using qualitative (non-quantitatively structured) variable structures and procedures…

Merbitz, C, Morris, J, Grip, JC. (1989) Ordinal scales and foundations of misinference Journal of Physical Medicine and Rehabilitation

Fundamental deficiencies in the information provided by an ordinal scale constrain the logical inferences that can be drawn; inferences about progress in treatment are particularly vulnerable. Ignoring or denying the limitations of scale information will have serious practical and economic consequences. Currently, there is a high demand for functional assessment scales within the rehabilitation community. It is hoped that such scales will satisfy the very real need for measures of function which reflect the impact of treatment on patient progress. Unfortunately, some commonly used evaluation instruments are not well suited to this task. The underlying rationale for clinical decision-making based on these scales is examined.

This study aims to contribute to the perpetual controversy on the parametric analysis of ordinal data, by giving a perchance long overdue examination of the widely held notion that sums of ordinal variables (e.g., Likert and summated rating scales) produce measures at ordinal level. In the present study, all 1,048,574 subscales of a well-known and widely applied sumscale, the 20-item CESD scale for depression, were assessed for their metrological properties. It was found that subscales consisting of less than 60% of the items of the original scale have lost all metrological properties of that scale, including ordinality as measured by Kendall’s tau. This result justifies concern about the robustness of measurement scale properties of (shortened) sumscales, and by implication, of the empirical findings based on such scales.

Here is a nice synthesis of Bayesian and Frequentist nonparametric rank based techniques that
deserves a close study.

Nice discussion of proportional odds and partial proportional odds models.

Related Threads

A very nice discussion on this in relation to patient reported outcomes starts here:

Some thoughts I’ve had on information theory and meta-analysis


What an amazing resource you’ve put together Robert! I’m linking to it from Resources for Ordinal Regression Models and including it in the next issue of Statistical Thinking News.


I need to clean up the argument a bit, but I think I’ve correctly argued this practice is not defensible from any statistical philosophy, whether Frequentist or Bayesian.

Applying decision theoretic principles to the design of experiments leads to the conclusion that maximizing utility coincides with maximization of information. Any practice that fails to do so when it is computationally feasible is irrational, or as Joel Michell might say “pathological.”

Some links connecting decision theory, information, and experimental design.