Sample size requirements for a proportional odds model

mwebb · March 12, 2021, 4:38pm

When I’m fitting a (binary) logistic regression model, sample size is a big concern for the stability of the model. Specifically, the number of events (or non-events, whichever is smaller) needs to be large enough and the number of events in each level of the categorical variables needs to be adequate.

Since the proportional odds models is a extension of logistic regression, it seems that adequate sample size in each level of the outcome variable would be important but difficult to achieve. Is this a valid concern for the proportional odds model?

f2harrell · March 12, 2021, 8:54pm

In a way it’s not a concern. The proportional odds (PO) model works fine for continuous Y with no ties, i.e., every category has only one observation. For an ordinal Y with k categories and category i having n_i observations, the effective sample size of Y is given by the formula below. Effective sample size = equivalent statistical information/efficiency in a sample of a given size from a continuous Y.

n - \frac{1}{n^{2}} \sum_{i=1}^{k} n_{i}^{3}

See this article for several examples of statistical information for various distributions of Y.

For sample size and power calculations see the nonparametric statistics chapter in BBR.

S_Chakraborty · March 13, 2021, 11:17am

The proportional odds (PO) model works fine for continuous Y with no ties, i.e., every category has only one observation.

If we were to use PO model to determine the impact of a covariate (X) on a certain variable (Y) where Y is a count with a relatively narrow range (say between 8 -18). It is likely that around half of the observations will have a Y of around 17 and the remaining will be dispersed with clumping at 8, 10, and 13. In this can we can consider Y to be an interval type of variable. Is PO a good bet or would a model like Poisson be better ?

f2harrell · March 13, 2021, 12:27pm

Any semiparametric model (including the PO model) is likely to fit better than Poisson or other highly parametric models. Semiparametric models allow for arbitrarily many ties at any Y value. Think of the limiting case. When there are only two distinct Y values, the PO model just reduces to binary logistic regression. PO models allow for floor and ceiling effects, bimodality, and any other thing regarding the distribution of Y at a single covariate setting.

S_Chakraborty · March 14, 2021, 3:34pm

I am trying to simulate possible sample sizes for a proportional odds model using the posamsize function available in the Hmisc package. The model will look at a Y (dependant variable) that has 11 possible levels. After a round of going around I finally came across a simple way to generate a probability distribution for these 11 levels (probability of the number of observations for each level) using the Dirichlet distribution (stackexchange answer). The code I am using is as follows

library(Hmisc)
library(tidyverse)
library(gtools)

df <- as.data.frame(rdirichlet(2000,alpha=c(1:11)))
df <-df %>% 
  rowwise(.) %>% 
  mutate(samplesize=posamsize(c_across(V1:V11),odds.ratio=2.0,power=0.8,alpha=0.05)$n)

This generates a data frame with 11 columns and 2000 rows where each column corresponds to the levels of the Y. The row sums are always 1. The sample size using this method seems to range between 198 - 205. This implies that the method is generating probabilities at each level which are quite narrow (for example in V11 the range is between 0.04 - 0.34 and V1 is 0 - 0.13)

I have previously tried a naive method using expand.grid to accommodate all possible combinations of probabilities ranging between 0 - 0.7 at 0.1 increments but ran out of memory.

The method does demonstrate what the BBR notes state that the frequency of observations in each category of Y does not influence that sample size that much - I have noted the largest sample size occurs when there are extreme distributions - e.g. one category gets 95% and another one gets 5% and rest are 0.

I would like to know if the method of simulation I am adopting is correct and if a better method is available that will show sample size for the extreme probability distributions of Y

Edit I modified the simulation code as below:

set.seed(100)

df = NULL

for (k in 1:200) {
  a <- as.data.frame(rdirichlet(100,alpha=floor(runif(n=11,min=0,max=10))))
  b <- as.data.frame(rdirichlet(100,alpha=floor(runif(n=11,min=1,max=3))))
  c <- as.data.frame(rdirichlet(100,alpha=floor(runif(n=11,min=1,max=11))))
  d <- as.data.frame(rdirichlet(100,alpha=floor(runif(n=11,min=1,max=1))))
  df <- rbind(df,a,b,c,d)
}

df <-df %>% 
  rowwise(.) %>% 
  mutate(samplesize=posamsize(c_across(V1:V11),odds.ratio=2.5,power=0.8,alpha=0.05)$n)

f2harrell · March 14, 2021, 4:24pm

Santam it would be good to lay out the design characteristics you have in mind. It looks as if you are trying to do a sample size calculation that takes into account uncertainty in the distribution of Y for control subjects. That’s a good goal because we usually pretend that the Y distribution estimated from previous studies is estimated without error. The sample size range you are seeing is narrow enough that I think you are not factoring in enough uncertainty. I don’t have deep enough working knowledge of the Dirichlet distribution to know exactly what it’s doing, and how the concentration parameter should be chosen.

In a simpler setting where the outcome is binary is time to a binary event, several authors have shown how to take uncertainty in the treatment effect into account. Some references on Bayesian sample size calculations may be found in this section of my Bayes course: Introduction to Bayes for Evaluating Treatments - 9 Bayesian Clinical Trial Design .

I do think uncertainties in the proportional odds model intercepts have been addressed in the literature, so I’m glad you’re working on this.

S_Chakraborty · March 15, 2021, 1:19am

Frank,
I have given some details of the design in the post Cross sectional study design and analysis technique for evaluating the impact of COVID 19 on radiotherapy quality of care indicators. Briefly my goal is to analyze the impact of the treatment period on the number of Quality Indicators adhered to.
For each treatment course, we will therefore have a count of QI adhered to - for some it may be 19, while others it may be 0. Realistically, however, it is likely the number will lie between 18 to 8. This is because some of the QIs are always adhered to in our setting (and unfortunately one is never).

I started out by doing a sample size calculation using the posamsize function with probability vectors like this : (note that the values in the vector represent the proportion of the treatment courses that will fall in each of the 11 possible counts)

p <- c(0.9,0,0,0,0,0,0,0.1,0,0,0) #Sample size = 415.5
q <- c(0.8,0,0,0,0,0,0,0.2,0,0,0) #Sample size = 233.7
r <- c(0.7,0,0,0,0,0,0,0.2,0.1,0,0) #Sample Size = 178.1

and so on but this quickly became tedious.

I noted though that the sample size increased as we had more extreme values of the probability and that it did not matter where the values were placed in the vector. For example:

x <- c(0.7,0.0,0.01,0.02,0.03,0,0.03,0,0.2,0,0.01) # Sample Size = 172.9
y <- c(0.6,0.0,0.01,0.02,0.03,0,0.03,0,0.3,0,0.01) # Sample Size = 148.9

As I have outlined in my last post I tried to use the expand grid function to generate probability but ran out of memory. Essentially I realized I needed a set of 11 probabilities where the sum would equal 1. So that is where the Dirichlet function came in (after some searching in StackExchange).

I put this up here as I also feel that the narrow range of sample sizes I am getting is unrealistic. However, I know that it is also unrealistic that we will have a probablity distribution like p for example. Most likely we will have something like x and y but what exactly will come I have no idea. Hence I decided to simulate this using random numbers.

After some playing around with the rdirichlet function I realized that the alpha parameter influences how the probablity - hence I decided to put random numbers in the alpha parameter in the code above.

After further searching I came across the RandVec which does the same thing - generate a set of vectors of values with a fixed sum (in this case 1).

library(Surrogate)
library(data.table)
library(Hmisc)
x <- c(0.6,0.0,0.01,0.02,0.03,0,0.03,0,0.3,0,0.01)
posamsize(x,odds.ratio=2.5)

l <- RandVec(a=0,b=1,s=1,n=11,m=100)$RandVecOutput
l <- as.data.frame(l)
l1 <- l %>% transpose(.)

l1 <- l1 %>%
  rowwise() %>% 
  mutate(ss=posamsize(c_across(V1:V11),odds.ratio=2.5,power=0.8,alpha=0.05)$n)

The resultant sample sizes are essentially similar.
I would love to learn more about how we should actually do these sample size calculations from you Frank.

Link to the protocol for the study being drafted is here : Link to protocol.

f2harrell · March 15, 2021, 12:25pm

Thanks for the additional information. I hope someone else can add to this discussion as I don’t have sufficient experience in doing realistic power calculations for ordinal Y. I am sure that your approach is better than just fixing the vector of probabilities. My only idea right now is to fix the concentration parameter at a series of values and for each value simulate probabilities and separately for each of the 11 simulated probabilities compute its 0.025 and 0.975 quantiles. Choose the concentration that yields the right amount of uncertainty by your subjective judgment.