Comparing like variables in logistic regression -- variable selection issue vs. model selection issue?

mkvdp · March 23, 2020, 5:45pm

Hi there,

I’m working on a project in which I have been tasked with determining which of four specific variables are most associated with/predictive of a binary outcome variable (after adjusting for a set of confounding variables), which represents whether or not these patients have a specific clinical condition. The four variables reflect two different indexes (index type 1 and 2) of a separate clinical condition, with Xray and MRI versions of both indexes:

var_1 = Xray index type 1 (scale=0.5-2.0)
var_2 = MRI index type 1 (scale=0.5-2.0)
var_3 = Xray index type 2 (scale=0.0-2.5)
var_4 = MRI index type 2 (scale=0.0-2.5)

Basically, the question is: Of the four methods of measuring clinical condition 1, which is most associated with condition 2?

Because all of the indexes are all measuring condition one, they are relatively highly correlated with each other (r = 0.5 - 0.7). My main question is whether all four of these variables should be entered into one model controlling for a set of confounders versus running four individual models for each variable (adjusting for the same confounders), and comparing the individual models using AIC or something of that variety? I know that if they are entered into one model together, the interpretation for each individual variable of interest will be different (i.e., holding the var_2, var_3 and var_4 variables constant, var_1’s specific contribution is X). Is that the best way of answering my question? If so, how should I deal with the multicollinearity issue?

Thanks in advance.

f2harrell · March 23, 2020, 6:00pm

As detailed in my RMS book and course notes it is far better to formulate a single model and stick with it. In this case I would

fit a model with all 4 variables and compute the likelihood ratio \chi^2 chunk test for the importance of the whole group
Run a variable clustering algorithm (e.g. R Hmisc package varclus function) to indentify colinearities among the 4
Run chunk tests of sets of these variables that are colinear and hence are competing with each other, to let them join forces instead of competing
Depending on the last 2 steps, don’t show invidividual p-values for competing variables among the 4

mkvdp · March 23, 2020, 9:11pm

Thank you for your reply. Could you possibly elaborate on what you mean by “chunk tests”? That is a term I am not familiar with. Also, I should mention that I am running log-binomial models (poisson with log link, robust standard errors, and independence correlation structure) because I’m not working with rare events. The way to do this in R is using the gee or geepack packages, which don’t appear to give me the typical regression output (no likelihood ratio estimates):

# geeglm output
Call:
geeglm(formula = PI ~ var_1 + var_2 + var_3 + 
    var_4 + conf_1 + conf_2 + conf_3, family = (poisson(link = "log")), 
    data = mydata, id = mr, corstr = "independence")

 Coefficients:
               Estimate  Std.err  Wald Pr(>|W|)    
(Intercept)    -6.70180  1.06382 39.69  3.0e-10 ***
        var_1    0.71228  0.55935  1.62    0.203    
        var_2   0.83986  0.44909  3.50    0.061 .  
        var_3  0.03722  0.42114  0.01    0.930    
        var_4  -0.72873  0.24596  8.78    0.003 ** 
      conf_1    0.03413  0.00628 29.58  5.4e-08 ***
      conf_2  -0.00856  0.00611  1.96    0.161    
      conf_3 0.78448  0.19683 15.89  6.7e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = independence 
Estimated Scale Parameters:

Number of clusters:   243  Maximum cluster size: 2 

# Anova results:
Analysis of 'Wald statistic' Table
Model: poisson, link: log
Response: PI
Terms added sequentially (first to last)

          Df   X2 P(>|Chi|)    
var_1 1 26.7   2.4e-07 ***
var_2 1  6.4    0.0114 *  
var_3  1  0.1    0.8215    
var_4  1 10.0    0.0015 ** 
conf_1 1 33.7   6.3e-09 ***
conf_2 1  1.3    0.2473    
conf_3 1 15.9   6.7e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Again, thank you for your advice. I sincerely appreciate it.

f2harrell · March 23, 2020, 10:28pm

Do you have multiple observations per person? If so please expand on the design. Otherwise I don’t see the need for GEE. Also why can’t you handle rare binary outcomes with a regular logistic model?

mkvdp · March 23, 2020, 10:55pm

Sorry - the observations are on shoulders. Two participants had both shoulders meeting criteria for inclusion, with the remaining 241 participants only having one shoulder meeting inclusion. So there are 245 shoulders total, and 243 actual participants total.

In my post above, I meant that the outcome/condition I’m looking at is not rare (38% of shoulders have it). Because of this, I was getting inflated ORs and 95% CIs (such as OR = 8.6 [95% CI, 2.3 to 64.5]) when I was using logistic regression with a logit link. As this paper demonstrates, log-binomial models are necessary for more accurate estimates when the outcome is common (>10% of sample). As the author notes, GLMs with family=binomial and link=log rarely converge, so he suggests running the analysis using a modified log-poisson approach:

library(gee) 
pglm <-summary(gee(binaryoutcome~exposure, 
family=(poisson(link="log")), id=participantID, corstr="independence"))

Of course, I’m more than open to suggestions in case there’s an alternative approach you would recommend. An epidemiologist colleague of mine pointed me in the direction of the log-binomial model.

f2harrell · March 23, 2020, 11:28pm

I don’t think this is correct. You must be thinking of the risk ratio being a poor approximation to the odds ratio when the outcome is rare. The odds ratio works fine for all levels of risk. There is nothing wrong with the logistic model at all in your case. And the repeated observations are so rare that you can ignore them and do a plain old logistic model. Log binomial models are for getting risk ratios. Risk ratios are not recommended. Log binomial models are seldom needed.

The paper you cited should have been in a psychology journal because it’s really about cognitive psychology, i.e., what happens when people misinterpret things. This isn’t statistics. Stick with odds ratios and you’ll avoid trouble.

mkvdp · March 24, 2020, 12:11am

I hear what you’re saying; however, no, I wasn’t thinking of the RR being a poor approximation to the OR when the outcome is rare. I was stating the opposite, that RRs are recommended when the prevalence of the outcome is greater than or equal to 10% of the full sample.

If you’re open to it, here’s a resource with more citations that have been used to justify using log-binomial models when binary Y is common. This particular article is frequently cited.

Again, thank you for taking the time to help me. Your suggestions are very helpful.

f2harrell · March 24, 2020, 1:46am

This is not correct. The RR has a mathematical constraint so that probabilities stay in [0,1]. For example a risk ratio of 2 cannot apply to a base risk of > 0.5. RR is also highly dependent on the choice of the reference event vs. non-event class (switch the reference for an OR and you’ll just get the reciprocal of the original OR). RR’s are not recommended in general and certainly not when base risk is higher. The use of log binomial models here is solving a problem that doesn’t exist and results in models with false interactions that exist just to keep probabilities in [0,1]. Choose a measure that does not have range restrictions. You have to realize why logistic models are so popular. The logit has no range restriction and does not induce false interactions just to make up for constraints. The convergence problem you mentioned is a clue about the underlying problem.

These misunderstandings persist in some regions of epidemiology, unfortunately. I have lots of problems with the second article you cited, which I’ve personally communicated to Donna.