Checking the assumptions of ordinal models according to RMS book

albertoca · April 30, 2020, 7:30pm

I am trying to develop an ordinal model, according to the recommendations of the book Regression Modelling Strategies (Chapter 15).
The aim is to predict quality of life (EORTC ordinal scale, with range 0-100) after breast cancer surgery. In particular, it would be relevant to identify women with good results, e.g., P(QoL>75%) .
I have 217 observations.
I have built a model using some key variables, and I want to test the assumptions of several possible models.

I have obtained this plot, which is nowhere near as beautiful as the one in the book:

I have doubts about whether the log-log link is still reasonable here. On the other hand, the output shows the slopes for several models. I don’t know how to get the slopes for the loglog link (instead of the cloglog). I think it could be useful to expand here the explanation of how these plots are interpreted.

This is the code:

f <- ols (qol ~ surgery +rcs(age,3) + pol(EORTC), data=qol)
qolf <- fitted (f)

p <- function (fun , row , col ) {
  f <- substitute (fun ); g <- function (F) eval(f)
  z <- Ecdf( ~ qol , groups =cut2(qolf , g=5),
             fun=function (F) g(1 - F),
             ylab=as.expression(f), xlim=c(20, 100),  data=qol,
             label.curve =FALSE)
  print (z, split =c(col , row , 2, 2), more=row < 2 | col < 2)
}
p(log(F/(1-F)), 1, 1)
p(qnorm (F), 1, 2)
p(-log(-log(F)), 2, 1)
p(log (-log(1-F)), 2, 2)
# Get slopes of pgh for some cutoffs of Y
# Use glm complementary log-log link on Prob(Y < cutoff) to
# get log-log link on Prob(Y ??? cutoff)
r <- NULL
for (link in c( "logit" , "probit", "cloglog"))
  for(k in c(40,50,60,70,80,90)) {
    co <- coef(glm (qol<k ~ qolf , data=qol, family =binomial (link)))
    
    r <- rbind (r, data.frame (link=link , cutoff =k,
                               slope=round(co[2] ,2)))
  }
print (r, row.names =FALSE ) #

… and the slopes:
> print (r, row.names =FALSE ) #
link cutoff slope
logit 40 -0.15
logit 50 -0.13
logit 60 -0.13
logit 70 -0.15
logit 80 -0.13
logit 90 -0.18
probit 40 -0.08
probit 50 -0.08
probit 60 -0.08
probit 70 -0.09
probit 80 -0.08
probit 90 -0.09
cloglog 40 -0.13
cloglog 50 -0.11
cloglog 60 -0.10
cloglog 70 -0.10
cloglog 80 -0.07
cloglog 90 -0.07

f2harrell · April 30, 2020, 10:05pm

To estimate a distribution using the empirical cumulative distribution function takes at least 200 observations. I think you are seeing a lot of sampling variation, plus a fairly low predictive signal. To learn more about that repeat the process by stratifying a uniform(0,1) random number covariate vector into quintiles and computing the ECDFs of your existing dependent variable.

pmbrown · May 1, 2020, 12:10am

i will soon read this chapter and maybe then i can comment, although i don’t know R. I’m not sure what you mean when you say

albertoca · May 1, 2020, 7:52am

The slopes for straight lines in the figure for each distribution

albertoca · May 1, 2020, 7:58am

I used a random variable as a covariate by changing these two lines in the code example above.
qol$x1 <- runif(219, 0, 1)
f <- ols (qol ~ x1, data=qol)
qolf <- fitted (f)

The result is as follows. What is the solution for choosing distribution in samples with noise?

f2harrell · May 1, 2020, 3:22pm

In small or noisy sample there are only two choices I know of:

Fit a Bayesian model that allows relaxed assumptions and put priors on the “relaxed” part to move the fit towards a simpler model. This is far and away the most rational approach, but is more complex.
Fit a standard semiparametric model and hope that the assumptions hold. It could be worse; you could be using a parametric model.

albertoca · May 1, 2020, 4:49pm

I’m learning, I’ve never used ordinal models before. I have a bit of a mess with the terminology. I understand that the proportional odds model is the same as the cumulative (logit) link model. I understand that the other cumulative link models require parallelism, not linearity. I think that’s why you don’t call them “proportional odds models”. I understand that all of them are semi-parametric.
If what I said before is not wrong, I wonder if the plot about the constancy of distances between symbols as a test of parallel slopes for PO models, can be extended to the rest of cumulative link models.
For example, with the code:
f4<-orm(qol ~ surgery +pol(age) + pol(FUN) + pol(SYM) + educ, family=loglog, data=qol, x = TRUE, y = TRUE)

sf <- function (y)
  c( " Y ≥ 65 " =log( -log (mean (y >= 65) )),
     " Y ≥ 75 " =log( -log (mean (y >= 75) )),
     " Y ≥ 90 " =log( -log (mean (y >= 85) ))
  ) 

s <- summary(qol ~ surgery +age + FUN + SYM + educ, fun=sf ,data =qol) 

plot (s, which =1:3 , pch =1:3 ,  xlab = "loglog" , vnames = "names" ,
      main = " " , xlim=c(-4,4))

I obtain:

My reasoning is that since most patients have good quality of life (asymmetric distribution), I may be able to justify the use of the loglog link on that basis. And on the other hand the plot shows a result that seems acceptable, at least for the higher percentiles of quality of life, which most women have. Does that sound right?

f2harrell · May 1, 2020, 5:27pm

You’re exactly right that you just need to change the link function before plotting the cumulative proportions and this method applies widely. Don’t choose a link though on the basis of the distribution of outcomes overall or in a given covariate group. The link specifies that you are assuming parallelism of link(CDF) for that link.

albertoca · May 1, 2020, 6:12pm

You mean don’t specify the link function of the model after checking the result, but do it beforehand and then check the assumption without changing it?
Or do you mean that all link functions assume PO and therefore this link does not help to select the best approach?
My idea was that since I have too much noise to choose the best link function, I select it by clinical reasons (more patients in the high extreme of the quality of life, which is what I am most interested in predicting, for which the log-log could be better), and then I simply check the PO assumption with that plot. Is it ok?

f2harrell · May 1, 2020, 7:12pm

No, I mean that the data don’t have the information to allow you to reliably specify the link function, so you have to go with gut instincts. By default I use logit because odds ratios are easy to interpret. I can’t interpret coefficients with probit (normal inverse) links very well. Compromise: try two drastically different link functions (logit and log-log) and choose the one that globally fits better if you can assess that.

albertoca · May 1, 2020, 8:28pm

The gut feelings suggest the loglog link. On the other hand, I have seen in RMS that there is a quite interesting connection between this model and the Cox PH model. Indeed, the coefficients are similar. I wonder how to interpret these coefficients of a loglog model.
For example, radical mastectomy is associated with a worse quality of life, with a coefficient of 0.45, both in the ordinal loglog model and in the Cox model. Making a parallelism with the survival model, I wonder if the exponential coefficient of an ordinal loglog model means the same thing, or even if it can be called hazard ratio. Is it possible to formulate an intuitive parallelism about the interpretation of the exponential coefficient in both models?

albertoca · May 2, 2020, 9:23am

Another issue that makes me curious is whether you can use the frequencies of responses to choose the link in ordinal models. For example, if the ordinal response goes from 0 to 100, and most reponses are above 75, the log(-log(x)) function should have more resolution as it is asymmetric, and have more domain near 1. The opposite may happen with the cloglog link, perhaps better in theory if the answers are predominant in the low range.

I don’t know if it would be correct to use this, as a priori criterion to select the link function, which would depend on the exact parameterization of the orm function of the rms package.

f2harrell · May 5, 2020, 3:26am

The link function has absolutely nothing to do with that. The link function is only for linking. Linking means linking one type of subject with another type of subject. It’s all relative. You are trying to think in absolutes.

albertoca · May 5, 2020, 5:27pm

OK, I think I see what you mean. Thank you.
Could you please explain the interpretation of the (exponentiated) coefficient in a proportional hazards model with loglog link…

f2harrell · May 5, 2020, 6:11pm

The only interpretation I know is the discrete hazard model interpretation. The ratio of probabilities that Y = y | Y >= y for two groups. Not so appealing when time is not involved, perhaps.

albertoca · May 6, 2020, 5:15pm

For example, in my (loglog) proportional hazards model the response variable is quality of life, ordered from 0 to 100% (the best). Radical surgery has a coefficient of -0.49. The way to express this would be that this intervention prevents a given quality of life level, or even better QoL, with a hazard ratio of exp(-0.49)= 0.61 for each step. Correct? Can it be expressed more elegantly?

f2harrell · May 7, 2020, 2:26am

No that’s not it. A problem with the discrete hazard log-log model is that it deals with probabilities of equalities not probs. of \geq. If you Y=1, 2, 3, 4 then the odds ratio you get for comparing group A with group B is the ratio of odds that e.g. Y=3 | Y = 3 or 4 or Y=2 | Y = 2, 3, or 4. May be best to just get predicted probabilities and quote those.

I do as much as I can to stick with proportional because of interpretation. To relax the PO assumption you can use the Peterson-Harrell partial proportional odds model as implemented in the R VGAM package and the new rms package blrm function (Bayesian models) to be released soon.