Categorizing Continuous Variables

trumanfrancis · October 27, 2024, 12:01pm

In this paper predicting the risk of cruciate ligament rupture based on time of neutering in Labrador Retreivers the following result is presented:

. Risk of CR was increased in dogs
neutered before 12 months of age (OR = 11.38; P
= .01). Neutering before 6 months of age was not
a significant factor (P = .17), nor was neutering be
tween 6 and 12 months of age (OR = 3.11; P = .23).
Overall neutering was also not a risk factor (OR = 1.8;
P = .27).

It makes no sense that there is an increased risk in dogs neutered before 12 months of age while not between at less than 6 months or between 6-12 months. Is this an issue with dichotomizing the month of neutering? Unfortunately the paper provides very little data to evaluate.
cruciaterupture.pdf (648.8 KB)

f2harrell · October 27, 2024, 12:33pm

This is a good example of misleading subgroup statistics after arbitrary categorization. Every analysis should start with (1) a high-resolution histogram of the data (here, age) to check regions of support, and (2) a smooth, non-overfitted relationship with uncertainty bands (using splines, nonparametric smoothers, fractional polynomials, etc.). To safeguard interpretations the uncertainty bands should be simultaneous compatibility (confidence) intervals.

trumanfrancis · November 2, 2024, 3:21pm

Can categorizing a continuous variable create Simpson’s Paradox?

f2harrell · November 2, 2024, 9:05pm

Not sure but I think it’s possible that a form of the paradox could happen. The issue with the so-called paradox is the failure to condition on other relevant variables, and forcing something to be linear is similar to having omitted variables.

davidcnorrismd · November 24, 2024, 3:38am

That is pretty weird. I don’t see a Data Availability statement in the paper, but what’s the ethos in the veterinary research community? Can you reach out to them to request data?

trumanfrancis · November 30, 2024, 8:22pm

I think I might but it doesnt seem to be routine.

davidcnorrismd · December 4, 2024, 1:07am

I would encourage that! Just thinking about this in terms of dummy variables, intuitively it doesn’t seem possible that the OR for the sum of 2 dummies (≤6mos + 6–12mos = ≤12mos) wouldn’t be some kind of weighted average of the OR’s for the individual dummies.

Florian_Teichert · December 9, 2024, 1:59pm

Dear @f2harrell,

Multiple studies in my field of interest—sciatica (leg ± back pain caused by a lumbar disc herniation)—have shown that the baseline dominant pain location, whether dichotomized as leg pain > back pain (yes/no) or categorized into leg pain > back pain, leg pain = back pain, leg pain < back pain, holds some predictive value for outcomes.

I’m exploring ways to model this relationship continuously and would greatly appreciate your advice.

Using the publicly available dataset from this RCT, I attempted to predict leg pain intensity at 26 weeks through several approaches:

Separate modeling of leg and back pain at baseline:
ols(leg_pain ~ rcs(leg_pain0, 5) + rcs(back_pain0, 5)
Adjusted R2: 0.031
Incorporating the dichotomized variable (leg pain > back pain: yes/no):
ols(leg_pain ~ rcs(leg_pain0, 5) + legdom
Adjusted R2: 0.037
Using the difference between baseline leg pain and back pain:
ols(leg_pain ~ rcs(leg_pain0, 5) + rcs(paind, 5), data = d
Adjusted R2: 0.051
(Correlation between leg_pain0 and paind: 0.34)*

These results suggest that the relationship between back and leg pain may add predictive value. Would you recommend simply modeling the difference (e.g., leg pain - back pain) or exploring alternative approaches to capture this relationship more effectively?

Thank you in advance for your time and insights!

f2harrell · December 10, 2024, 12:43pm

If pain intensity is measured at more time points than 26w you may get a more sensitive analysis by analyzing longitudinally. To stick to 26w, do the residuals have a normal distribution with constant variance with respect to varying baseline values? Or do you need to use ordinal regression? To your main point, I would phrase the question as this: What is the best model for how baseline leg and back pain (continuous) relate to later leg pain (continuous)? If a continuous relationship seems to have a discontinuity with flatness on either side of the discontinuity, then you would validate what’s in the literature. That seldom happens. Focus on estimating the 3-D relationship between the variables, allowing for interaction between baseline leg and back pain.

When you want to add an interaction in the model, a key question is how many knots to put on the two individual variables. More knots means better fit but possibly more noise and too many interaction terms. Consider comparing AICs for these models:

y = x1 + x2 + x1*x2
y = rcs(x1, k) + rcs(x2, k) + rcs(x1, k) ia rcs(x2, k) for k = 3, 4, 5

For the model with the best AIC, do a chunk test for all the interaction terms (automatic with anova.rms) to gauge evidence for their importance. Plot the 3-D relationship with a wireframe, contour, or heatmap image.

davidcnorrismd · December 24, 2024, 8:56am

Interesting to consider what the anatomical correlates would be. Does disc herniation evolve in any typical way? Are there even distinct categories [gasp!] of evolution courses? Also, are there ‘global’ patterns of pain (with less distinct localization) that would introduce error-in-variables issues with the arbitrary categorization into ‘>’ vs ‘<’. (In that case, the clinical assessment might amount to ‘model averaging’.)