Percentage as a measure of a predictor variable

trumanfrancis · February 22, 2024, 1:55pm

What is the best method for handling a predictor variable that is measured as a percentage? In this case the percentage of days that an individual receives an intervention.

f2harrell · February 23, 2024, 1:21pm

I don’t see why that would be treated any differently from other continuous variables, e.g., using splines or quadratics. It’s strange to see a dependent variable as an independent variable though.

trumanfrancis · February 23, 2024, 4:08pm

The variable is the administration of a legal medication to Thoroughbred racehorses. The researchers want to see if the administration of furosemide to horses on the day that they race predicts or is associated with time to retirement. They want to use the percentage of times the horse receives furosemide as the x variable and have suggested that it be divided into quartiles as a predictor. The hypothesis is that the horses in the highest quartile would have longer careers. I thought as you suggest it could be a continuous variable from 0,1. The data comes from an historical electronic database.

f2harrell · February 24, 2024, 1:06pm

Grouping a continuous variable is always a bad idea, and grouping by quantiles is even worse. Grouping by quantiles says that you believe that how many horses there are like a given horse is important in predicting the one horse’s outcome. Quantiles are used when you want to create a competition between individuals as in grading on the curve.

trumanfrancis · February 24, 2024, 3:36pm

Many thanks for your input. I am wondering if they could provide individual data on the horses and whether they receive furosemide or not for each start. In that case perhaps a model where treatment is a time varying exposure would be more useful.

f2harrell · February 25, 2024, 12:45pm

Yes; it’s almost always the case that a proportion computed on longitudinal data is better analyzed with a longitudinal binary model. A simple approach that allows time-dependent covariates is a Markov binary logistic model which is a simpler case of this.

technocrat · February 26, 2024, 6:57pm

I have a naive question: Although a percentage is a continuous variable as a matter of form, could the numerator be more important? Take as an example Secretariat, who had 21 races and lived to age 19. Assuming a standard dosage and frequency, he would have had a lower lifetime exposure than Whitmore, who had 40 starts and is currently 11. And Whitmore is a gelding, while Secretariat had a long second career as a stud. If your database is rich enough, you might benefit from approaching this as a problem that would benefit from directed acyclic graph analysis to better stratify?

f2harrell · February 27, 2024, 12:44pm

This also goes back to the previous question. Analyze the rawest form of longitudinal data.

trumanfrancis · June 21, 2024, 4:53pm

Would it be possible to implement a joint longitudinal model with the exposure being a count variable (number of doses of lasix). Joint modeling of time‐varying exposure history and health outcomes: Identification of critical windows - Wagner - 2020 - Alzheimer’s & Dementia - Wiley Online Library