Modelling HUI3 using beta regression

Hi there,

I’m currently working on a research project where we’re developing a predictive algorithm for health utility index mark 3 (HUI3). HUI3 is a continuous health related quality of life measure whose range is the closed interval [-0.36, 1]. A value of 1 implies perfect health, a value of 0 implies death and a value of -0.36 implies a state worse than death. Couple of points to take note of when modelling HUI3:

  1. Its highly skewed to the right with peaks at 1 and 0.97
  2. Its range is limited to -0.36 and 1

I was thinking of using a beta regression model to model the conditional mean of HUI3 using the beta reg package in R.

  • I was going to use the transformation mentioned in the betareg paper to restrict HUI3 to be between (0,1). This would be (y∗(n−1)+0.5)/n where n is the sample size (Reference: Smithson M, Verkuilen J (2006). A Better Lemon Squeezer? Maximum-Likelihood Regression with Beta-Distributed Dependent Variables." Psychological Methods, 11(1), 54{71.)
  • I was going to use the log-log link function due to the large number of values close to 1 (They recommend this in the paper)

Some points that I’m concerned/thinking about:

  • What assumptions do I need to test for in a beta regression model and how do I test them?
  • Should I use a variable or fixed precision parameter? Or how would I go about deciding this?

Do you have any thoughts/have you worked with HUI3 before?

Thanks for your help!


Beta regression assumes a very specific distribution, and the transformation to [0,1] is way too arbitrary. I would use semiparametric regression, e.g., the proportional odds ordinal logistic regression model, with no binning of Y, and no pre-transformation of Y. This would also handle an arbitrary amount of ties in Y.

1 Like

Thanks for your reply Dr. Harell. That was one of the models I was considering after reading about it in your RMS textbook, however I’ve never worked with it before. A few issues/questions I have:

  1. The models fits a unique intercept for every unique value of Y of which HUI3 has “972,000 unique health states”. Is that a potential issue with the rms package?
  2. The dataset we have may not actually have a row for every unique state. Does that mean the model would never predict a state that does not exist in the development dataset for example in a new individual in the test set?
  3. Do you have an idea of what link function we would use? I think logit would be out since HUI3 has negative values
  4. Our objective after the study is to implement the algorithm into a web tool, meaning that we would need to implement it in Javascript. Your RMS textbook provides a formula for calculating the mean Y (shown below). Is there an example online of doing this calculation by hand? Or is there code you can point to in your RMS package where you implement this?

Thanks a lot for taking the time to reply!

Formula for calculating mean Y given X in RMS textbook,

\sum_{i = 1}^{n}{y_i \hat{Prob}[Y = y_{i} | X]}


The pertinent number of distinct Y values to consider is the number of distinct observed values. If you have more than, say, 6000 observed distinct values you can do a tiny amount of binning to make things run. From the model fit you will not be able to predict the probability of having a Y value that was not observed, or you could say you predict that the probability is 0.0.

The link function applies to the exceedance probability scale, and probabilities are always in [0,1] so any link function can possibly work. Some stratified CDFs with various link transformations may guide you in selecting the link that yields parallelism.

The Mean function in the rms package (full name Mean.orm) will derive an R function that computes the mean given X\hat{\beta}.