BBR Session 2: Probability and Descriptive Statistics

This is a topic for questions, answers, and discussions about session 2 of the Biostatistics for Biomedical Research web course airing on 2019-10-11. Session topics are listed here.


As requested, here is a repost of my question.

"If our goal is more description of the central tendency observed in the sample rather than inferences (where efficiency may matter more, assuming the formal meaning of efficiency relating to variance) then median would still be worth while (i.e. describing survey responses for ordinal rating scales).

Is that fair or do you still think the arithmetic mean may be better if the scale has more than 3 ordered levels? I am curious to hear an example about how the interpretation may be better with the mean instead of the median for something such as a 7-point, discrete rating scale."

Thank you!

similarly, length of stay in hospital seems to be a common variable but little standard analysis decided upon? eg average, median (survival), mode, max, z-scores…

for likert 6 i might dichotomise it

Quantiles only work well with continuous distributions. When there are many ties in the data, the addition of one data point can drastically change the estimated quantile. That’s why I use the mean as often as I do for discrete ordinal variables.


Instead of length of stay, think time to successful discharge, which lends itself to the use of right censoring in case of death. LOS is often best analyzed with the Cox PH model, whether you have censoring or not. From that model you can estimate quantiles, the mean, and the probability that LOS > x for any x. “Just say no” to z-scores.


agreed, but i hardly ever see anyone do it

1 Like

If I understand your position – you would consider the mean a valid descriptive tool on ordinal data. That brings 2 questions to my mind:

  1. Would you continue to feel comfortable using the mean difference as an inference tool in this situation?

  2. Can you also give an example of ordinal data where you would be comfortable using the mean difference?

At least in the rehabilitation literature, we use a multitude of subjective patient rating scales where it is simply assumed that the mean difference is valid. From my reading of the relevant literature in measurement theory, these attributes are best thought of (from a mathematical POV) as ordinal (ie. pain, perceived exertion, perceived disability, etc). A good discussion of the primary issues that relate measurement theory to all sorts of psychological testing can be found here and here.

Changes in these scales are used to justify various clinical decisions at the individual level (as well as to allocate resources). With the move to “value based” payment models that shift risk to clinicians and the organizations that employ them, I fear that these tools do not have the frequency properties that are assumed by decision makers, as is described in the paper you cited months ago – Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?

I ask because I suspect my experience might be too narrow to have good intuition in this area.

1 Like

I am not comfortable using the mean difference. I would do formal comparisons on the model metric, e.g., odds ratios.


I would like to better understand the relationship between the geometric mean, the sample median, and the log transformation. @f2harrell mentioned something about the relationship between these concepts in BBR session 2 but didn’t go into detail.

(Aside: What is the difference between using a linear model with a log transformed outcome variable and a “log-linear” model?)

Suppose I would like to model healthcare cost data which has lots of zeros and is highly right skewed.

# generate some fake highly skewed data
x <- factor(sample(c("A", "B"), 120, replace = T))
y <- c(rep(0, 20), rpois(100, lambda = 2)*2^(abs(rnorm(100, 0 , 4)))) * (1.8^(as.integer(x) - 1))
y <- round(y)

The naive approach I have tried is simply to log transform the outcome after adding a small constant to avoid log(0).

m1 <- lm(log(y + 0.1) ~ x)

However, the mean of the model’s predicted values is much closer to the sample median than the sample mean.

#> 144.5083

#> 10

#> 6.937442

Is the mean of the exponenitated model predictions actually estimating the sample median? What is the relationship to the geometric mean? Is there a better way to model this sort of data? (e.g. ordinal regression?) I would like my model to generate realistic predictions.

Per BBR session 1 - dichotomising a variable is apparently a very bad idea. A six level likert scale would contain 3 bits of information (right?) while a binary variable contains only 1 bit.


There are two weaknesses in using logs: (1) we don’t usually know that log is the right transformation, and (2) the little constant you add to avoid zeros is arbitrary and matters a lot.

We take logs in an attempt to get a symmetric distribution. If we succeed, the mean(log(X)) = median(log(X)) so that the antilog of either one of these estimate the median X on the original scale. Note that the antilog of a mean does not equal the mean of the antilog, unlike the median.

Healthcare cost can be well modeled using robust semiparametric ordinal models such as the Cox proportional hazards model. See for example the talks I’ve given on analysis of cost.


Coming back to my question, BBR Session 2 video reminded me of something very simple that would have answered my question initially. Using ranks is best done without ties => ranks(quantiles) needs continuous variables to work best. Dr. Harrell points this out in the session, which should be a good reminder to revisit basic assumptions about variables when using certain procedures or summary statistics. Every rank-based test is initially introduced as a procedure where the data are uniquely valued such that no two have the same value/position because what matters is the ordering; then these tests were adapted for scenarios with ties (at least in my reading, this is how I’ve always seen rank-based procedures presented).

1 Like

Thanks for bringing this point up. This used to be the case, because we don’t have great methods for getting accurate p-values in the case of heavy ties in a nonparametric (or parametric) test. But when you don’t use the rank test but instead use its generalization—semiparametric regression models—you no longer care about extensive times. In the limit the proportional odds model becomes the binary logistic model. The prop. odds model handles ties extremely well, especially when using the likelihood ratio test from the model.

This is in contrast with sample quantiles and quantile regression which assume a scarcity of ties.


Was hoping for some references on use of clinical overrides for ordinal data, as discussed in regards to Figure 4.6 in BBR during the last session. I’m just having difficulty seeing the utility of that approach. Thanks so much!

yes, the ‘do not dichot’ has been written everywhere for 15+ years eg senn disappointing dichotomies 2003. but i dont belive the precision in likert 7, and you see everyone dichot it eg

I don’t have a general reference. There are just a lot of discrete ordinal scales in use that have the same spirit, but they don’t typically have a continuous component for most of the levels. The clinical utility seems clear to me so please follow-up with specific questions. Here are some of the thoughts:

  • A clinical override at the top levels of the ordinal scale gives you a placeholder so that you don’t need to ignore death and other bad outcomes
  • The approach is statistically efficient
  • The main assumption is that the clinical events are worse than any level of the other parts of the scale
  • The interpretation is clear, e.g., how does the probability of something worse than y happening to a patient depend on the treatment.

In next gen sequencing experiments, the readout of the experiment is the count of a particular RNA/DNA sequence. Many of the methods used to analyze the data use Poisson or negative binomial distributions for the modeling. I am intrigued by the idea of approaching this as an ordinal variable without any assumption as to the underlying distribution as noted in Sec 4.1.2.

Is anyone here aware of nonparametric/ordinal regression approaches (as suggested in Section 4.1.2) for the analysis/modeling of these sequencing datasets? Are there any references out there to such an approach?

It’s an approach that is fraught with difficulty. What do you consider a ‘small’ start? If you add 0·1, as you did, you are replacing zero with –2·3. However, if you chose a start of 0·01 or 0·001, you would be replacing it with –4·6 or –6·9. Choice of start can be very influential in your calculations.

Why alter known values at all?

There are other ways of tackling the problem. Have a look at Bill Gould’s entertaining blog post Use poisson rather than regress; tell a friend


I don’t know of any direct comparisons with ordinal and Poisson. Would be a great graduate student project. A sequence count may work well with Poisson. I’m especially interested in getting away from Poisson when there are lots of zero counts or when the upper limit is huge.

There have been many smart questions above. Mine is much simpler but I would like @f2harrell to clarify what you mean by ‘small’ when talking about the number of categories in Ordinal Random Variables. According to the notes, using simple proportions might be helpful for Ordinal Random Variables when the number of categories is ‘small’. How many categories is that? 3-4?