Categorizing Continuous Variables

In this paper predicting the risk of cruciate ligament rupture based on time of neutering in Labrador Retreivers the following result is presented:

. Risk of CR was increased in dogs
neutered before 12 months of age (OR = 11.38; P
= .01). Neutering before 6 months of age was not
a significant factor (P = .17), nor was neutering be
tween 6 and 12 months of age (OR = 3.11; P = .23).
Overall neutering was also not a risk factor (OR = 1.8;
P = .27).

It makes no sense that there is an increased risk in dogs neutered before 12 months of age while not between at less than 6 months or between 6-12 months. Is this an issue with dichotomizing the month of neutering? Unfortunately the paper provides very little data to evaluate.
cruciaterupture.pdf (648.8 KB)

1 Like

This is a good example of misleading subgroup statistics after arbitrary categorization. Every analysis should start with (1) a high-resolution histogram of the data (here, age) to check regions of support, and (2) a smooth, non-overfitted relationship with uncertainty bands (using splines, nonparametric smoothers, fractional polynomials, etc.). To safeguard interpretations the uncertainty bands should be simultaneous compatibility (confidence) intervals.

1 Like

Can categorizing a continuous variable create Simpson’s Paradox?

Not sure but I think it’s possible that a form of the paradox could happen. The issue with the so-called paradox is the failure to condition on other relevant variables, and forcing something to be linear is similar to having omitted variables.

2 Likes

That is pretty weird. I don’t see a Data Availability statement in the paper, but what’s the ethos in the veterinary research community? Can you reach out to them to request data?

I think I might but it doesnt seem to be routine.

I would encourage that! Just thinking about this in terms of dummy variables, intuitively it doesn’t seem possible that the OR for the sum of 2 dummies (≀6mos + 6–12mos = ≀12mos) wouldn’t be some kind of weighted average of the OR’s for the individual dummies.

Dear @f2harrell,

Multiple studies in my field of interest—sciatica (leg ± back pain caused by a lumbar disc herniation)—have shown that the baseline dominant pain location, whether dichotomized as leg pain > back pain (yes/no) or categorized into leg pain > back pain, leg pain = back pain, leg pain < back pain, holds some predictive value for outcomes.

I’m exploring ways to model this relationship continuously and would greatly appreciate your advice.

Using the publicly available dataset from this RCT, I attempted to predict leg pain intensity at 26 weeks through several approaches:

  1. Separate modeling of leg and back pain at baseline:
    ols(leg_pain ~ rcs(leg_pain0, 5) + rcs(back_pain0, 5)
    Adjusted R2: 0.031

  2. Incorporating the dichotomized variable (leg pain > back pain: yes/no):
    ols(leg_pain ~ rcs(leg_pain0, 5) + legdom
    Adjusted R2: 0.037

  3. Using the difference between baseline leg pain and back pain:
    ols(leg_pain ~ rcs(leg_pain0, 5) + rcs(paind, 5), data = d
    Adjusted R2: 0.051
    (Correlation between leg_pain0 and paind: 0.34)*

These results suggest that the relationship between back and leg pain may add predictive value. Would you recommend simply modeling the difference (e.g., leg pain - back pain) or exploring alternative approaches to capture this relationship more effectively?

Thank you in advance for your time and insights!

1 Like

If pain intensity is measured at more time points than 26w you may get a more sensitive analysis by analyzing longitudinally. To stick to 26w, do the residuals have a normal distribution with constant variance with respect to varying baseline values? Or do you need to use ordinal regression? To your main point, I would phrase the question as this: What is the best model for how baseline leg and back pain (continuous) relate to later leg pain (continuous)? If a continuous relationship seems to have a discontinuity with flatness on either side of the discontinuity, then you would validate what’s in the literature. That seldom happens. Focus on estimating the 3-D relationship between the variables, allowing for interaction between baseline leg and back pain.

When you want to add an interaction in the model, a key question is how many knots to put on the two individual variables. More knots means better fit but possibly more noise and too many interaction terms. Consider comparing AICs for these models:

  • y = x1 + x2 + x1*x2
  • y = rcs(x1, k) + rcs(x2, k) + rcs(x1, k) ia rcs(x2, k) for k = 3, 4, 5

For the model with the best AIC, do a chunk test for all the interaction terms (automatic with anova.rms) to gauge evidence for their importance. Plot the 3-D relationship with a wireframe, contour, or heatmap image.

1 Like

Interesting to consider what the anatomical correlates would be. Does disc herniation evolve in any typical way? Are there even distinct categories [gasp!] of evolution courses? Also, are there ‘global’ patterns of pain (with less distinct localization) that would introduce error-in-variables issues with the arbitrary categorization into ‘>’ vs ‘<’. (In that case, the clinical assessment might amount to ‘model averaging’.)

1 Like