In this paper predicting the risk of cruciate ligament rupture based on time of neutering in Labrador Retreivers the following result is presented:
. Risk of CR was increased in dogs
neutered before 12 months of age (OR = 11.38; P
= .01). Neutering before 6 months of age was not
a significant factor (P = .17), nor was neutering be
tween 6 and 12 months of age (OR = 3.11; P = .23).
Overall neutering was also not a risk factor (OR = 1.8;
P = .27).
It makes no sense that there is an increased risk in dogs neutered before 12 months of age while not between at less than 6 months or between 6-12 months. Is this an issue with dichotomizing the month of neutering? Unfortunately the paper provides very little data to evaluate.
cruciaterupture.pdf (648.8 KB)
1 Like
This is a good example of misleading subgroup statistics after arbitrary categorization. Every analysis should start with (1) a high-resolution histogram of the data (here, age) to check regions of support, and (2) a smooth, non-overfitted relationship with uncertainty bands (using splines, nonparametric smoothers, fractional polynomials, etc.). To safeguard interpretations the uncertainty bands should be simultaneous compatibility (confidence) intervals.
1 Like
Can categorizing a continuous variable create Simpsonâs Paradox?
Not sure but I think itâs possible that a form of the paradox could happen. The issue with the so-called paradox is the failure to condition on other relevant variables, and forcing something to be linear is similar to having omitted variables.
2 Likes
That is pretty weird. I donât see a Data Availability statement in the paper, but whatâs the ethos in the veterinary research community? Can you reach out to them to request data?
I think I might but it doesnt seem to be routine.
I would encourage that! Just thinking about this in terms of dummy variables, intuitively it doesnât seem possible that the OR for the sum of 2 dummies (â€6mos + 6â12mos = â€12mos) wouldnât be some kind of weighted average of the ORâs for the individual dummies.
Dear @f2harrell,
Multiple studies in my field of interestâsciatica (leg ± back pain caused by a lumbar disc herniation)âhave shown that the baseline dominant pain location, whether dichotomized as leg pain > back pain (yes/no) or categorized into leg pain > back pain, leg pain = back pain, leg pain < back pain, holds some predictive value for outcomes.
Iâm exploring ways to model this relationship continuously and would greatly appreciate your advice.
Using the publicly available dataset from this RCT, I attempted to predict leg pain intensity at 26 weeks through several approaches:
-
Separate modeling of leg and back pain at baseline:
ols(leg_pain ~ rcs(leg_pain0, 5) + rcs(back_pain0, 5)
Adjusted R2: 0.031
-
Incorporating the dichotomized variable (leg pain > back pain: yes/no):
ols(leg_pain ~ rcs(leg_pain0, 5) + legdom
Adjusted R2: 0.037
-
Using the difference between baseline leg pain and back pain:
ols(leg_pain ~ rcs(leg_pain0, 5) + rcs(paind, 5), data = d
Adjusted R2: 0.051
(Correlation between leg_pain0 and paind: 0.34)*
These results suggest that the relationship between back and leg pain may add predictive value. Would you recommend simply modeling the difference (e.g., leg pain - back pain) or exploring alternative approaches to capture this relationship more effectively?
Thank you in advance for your time and insights!
1 Like
If pain intensity is measured at more time points than 26w you may get a more sensitive analysis by analyzing longitudinally. To stick to 26w, do the residuals have a normal distribution with constant variance with respect to varying baseline values? Or do you need to use ordinal regression? To your main point, I would phrase the question as this: What is the best model for how baseline leg and back pain (continuous) relate to later leg pain (continuous)? If a continuous relationship seems to have a discontinuity with flatness on either side of the discontinuity, then you would validate whatâs in the literature. That seldom happens. Focus on estimating the 3-D relationship between the variables, allowing for interaction between baseline leg and back pain.
When you want to add an interaction in the model, a key question is how many knots to put on the two individual variables. More knots means better fit but possibly more noise and too many interaction terms. Consider comparing AICs for these models:
- y = x1 + x2 + x1*x2
- y = rcs(x1, k) + rcs(x2, k) + rcs(x1, k) ia rcs(x2, k) for k = 3, 4, 5
For the model with the best AIC, do a chunk test for all the interaction terms (automatic with anova.rms
) to gauge evidence for their importance. Plot the 3-D relationship with a wireframe, contour, or heatmap image.
1 Like
Interesting to consider what the anatomical correlates would be. Does disc herniation evolve in any typical way? Are there even distinct categories [gasp!] of evolution courses? Also, are there âglobalâ patterns of pain (with less distinct localization) that would introduce error-in-variables issues with the arbitrary categorization into â>â vs â<â. (In that case, the clinical assessment might amount to âmodel averagingâ.)
1 Like