I have a binary response, which I would like to model in terms of (some subset of) six candidate predictor variables, so the natural choice was to fit a binary logistic regression model. I have two related questions.
Question 1 (model selection):
In order to identify the appropriate predictors to include in the regression model, I decided to perform an exhaustive search for the best subsets of the candidate predictor variables.
There is a total of 2^6 = 64 potential regression models, so for each of these models, I tested the global goodness-of-fit using the le Cessie–van Houwelingen–Copas–Hosmer unweighted sum of squares test.
Literature (including Harrell’s “Regression Modeling Strategies”) generally suggests that the le Cessie–van Houwelingen–Copas–Hosmer unweighted sum of squares test should be preferred to the orthodox Hosmer–Lemeshow test, so I decided to use that. Five of these potential models had a p-value > 0.95.
I decided to present these five best-fitted models according to this criterion in a table (sorted by the p-value in descending order).
The first one (with three explanatory variables) has a p-value of 0.993. The second one (with two explanatory variables) has a p-value of 0.989. However, the Akaike information criterion (AIC) suggests that the second model should be preferred to the first one, since the AIC is 811.31 for the first and 749.17 for the second.
As far as I know, the standard approach would be simply to use AIC for model selection. Does the le Cessie–van Houwelingen–Copas–Hosmer unweighted sum of squares test have any particular advantages over AIC, and does my approach look reasonable?
Question 2 (assessment of model fit):
For binary logistic regression models, would it generally be sufficient to report p-values of the le Cessie–van Houwelingen–Copas–Hosmer unweighted sum of squares test for assessment of model fit? As mentioned in Harrell’s book, there are no distributional assumptions whatsoever for binary logistic regression models. I could possibly include bootstrap overfitting-corrected lowess nonparametric calibration curves as well?
There are two issues going on. First, there’s the question of whether variable selection should even be done or you should pre-specify the model. I think you should do the latter. Variable selection can harm prediction and it definitely harms all other aspects of statistical inference. Second, if you were to really want to do variable selection, goodness of fit statistics are not the way to compare models. A log-likelihood-based index is the gold standard, including AIC.
For your second question, I prefer directed goodness-of-fit tests, e.g. tests of linearity and additivity. But the le Cessie et al test is a good one to report at the end of modeling, although not by itself. More important is the overfitting-corrected smooth calibration curve as you described.
Thank you very much for taking the time to answer my question! Unfortunately, I do not believe pre-specification of the model is an option in this case, since I do not have sufficient subject matter knowledge.
If I decide to use AIC for comparison, and compute this criterion for each model, the “best model” (i.e. the model with the smallest AIC value) includes five of six candidate explanatory variables, and surprisingly, all of them turn out to be significant in this new model. However, the le Cessie–van Houwelingen–Copas–Hosmer unweighted sum of squares test yields a p-value of 0.0057, which indicates a quite bad fit. So if I use AIC for model selection, I end up with a model, which is bad according to the goodness-of-fit test. That sounds like a contradiction to me. Would you still argue that this model is the best one, since it has the smallest AIC value?
Moreover, the calibration curve does not look particularly good to me:
I checked VIFs for the six candidate explanatory variables to be sure there was no multicollinarity issue, but they did not warrant further investigation (the maximum VIF is 1.54).
The full model with 6 predictors is a superset of all of the models you are entertaining. So the model specification is automatically handled for you. Put all 6 variables in the model and be done.
From what I understand, adding more explanatory variables will always improve the fit of the model, but there is a trade-off between goodness of fit and complexity of the model. And I believe the fact that AIC is a criterion for balancing fit with complexity makes it appropriate for model selection?
Finding the right balance is indeed challenging, but I would like to build a more parsimonious model to prevent overfitting. Although I could include all six predictors, I am not convinced that it would be advantageous. How do you know that the full model is a valid and robust model?
Some people claim that modeling is an art, not a science – but could you elaborate on why you would suggest to include all of them, even if the AIC increases and the p-value from the le Cessie–van Houwelingen–Copas–Hosmer unweighted sum of squares test just barely exceeds 0.05, while the new variable turns out not to be significant in the model? I do realize that the variable is explaining something in the response variable, since the coefficient is not literally zero, but couldn’t you keep on adding variables ad infinitum?
In fact, even more predictor variables were measured, but for this specific analysis, I decided to consider six of them as candidate predictor variables, since only these can be used in the production in a sensible matter.
Besides making it easier to interpret and communicate the results, wouldn’t using a subset also prevent overfitting?
My book covers this in great detail. Here are the main issues:
Parsimony and Occam’s razor only provide better results if the parsimonious model is completely pre-specified. For your purpose, AIC would only be useful if you compared two completely pre-specified models
Any attempt to force parsimony that comes from examining relationships between X and Y creates only the illusion of benefit and usually makes predictions worse
The data do not contain enough information for selecting the “right” model with any decent probability. The situation becomes worse if predictors are correlated with each other.
Keeping non-significant variables in the model, if not using something like lasso, is the only way to get accurate standard errors and accurate residual variance estimates
Your full model is so simple that trying to making it simpler seems to be an especially bad use of anyone’s time
Look at my RMS course notes to see an example where I use the bootstrap to obtain 0.95 confidence limits for the importance rankings of predictors. Even in a high signal:noise example, the confidence intervals are wide. This is a good way to show the difficulty of the task before you.
Variable selection/model selection is not as valuable as model specification.
Thank you for your suggestions. As mentioned above, I do not have sufficient subject matter knowledge to pre-specify the model in this case. However, the idea of using logistic lasso regression for variable selection seems interesting to me.
If I use logistic lasso regression with 10-fold cross-validation, is there any way to assess the performance using calibration plots? Or could I use some measures of discrimination?
I used the function cv.glmnet from the R package glmnet to fit a lasso regression model, but an object of class cv.glmnet is returned instead of a glm object. When I tried to use the calibrate function from your rms package to produce a calibration plot, it says that fit must be from lrm or ols.
I realised that no standard errors were provided, but it seems that there is no consensus on a statistically valid method of calculating standard errors for the lasso predictions. So they are probably not provided for a reason.
While I could easily fit a generalized linear model using the glm function or the lrm function from your package including only the variables chosen by the lasso method, I am not sure if that is a reasonable thing to do. Nevertheless, I suspect that might be what they did in this paper:
Variable selection and internal validation of the model were performed using the “glmnet” and the “rms” packages, respectively.
I could probably also report the p-value from the le Cessie–van Houwelingen–Copas–Hosmer unweighted sum of squares test, but again, I would need some lrm or ols object. Would it be OK to fit a generalized linear model using the glm function or lrm function including the variables chosen by the lasso method and then refer to it as “a logistic lasso regression model”?
You missed the main point of my earlier posts. If you don’t have sufficient subject matter knowledge to pre-specify the model, just fit the superset of all the models. You seem to be pulled by some force that makes you want to delete apparently non-contributing variables. There is no need to succumb to that force. The data cannot tell you reliably which variables are contributing.
You have hit on a limitation of frequentist penalized regression, as you don’t have a lot of inference to back up lasso or other penalized maximum likelihood estimation methods.
You also seem to be stuck on mixing goodness-of-fit tests with measures of model performance. Ignore goodness of fit for the moment.
It is completely inappropriate to use one method to select variables, then use unpenalized regression to fit the subset of variables. This approach does not inherit the correct amount of shrinkage.
If you use the lasso appropriately you don’t need to worry too much about calibration. But you do need to use the bootstrap to see if the lasso is selecting a stable list of variables. But again, don’t waist your time with variable selection in your case!
Now is the time to go back and read my book Regression Modeling Strategies where this problem is covered in detail.
Above all, remember that supervised learning (using Y) to select variables only seems to make the model smaller. There are phantom degrees of freedom that continue to haunt the analysis, and major issues with model uncertainty. Smaller models are not better.
I would be interested to know where you learned the idea that model selection is a good idea. You seem to have trouble shaking that idea. You must have had a convincing teacher who did not understand model uncertainty.
I agree with Franks comments, but also to respond to this:
I note you previously said this also:
So to me that implies you already applied your subject matter knowledge to chose the 6 variables from “even more predictor variables”. If that is the case, then these are the variables to include and report, in line with your a-priori subject matter knowledge. As you state it, there is no a-priori justification/reason to reduce the number of variables further, and so to attempt to interrogate the data for such a reason is incompatible with the act of choosing those 6 variables from “even more” in the first place. In fact not reporting any of those 6 variables chosen with your subject matter knowledge would be incorrect to my mind.
Nicely stated. I missed that. Note that it is imperative that the variables that were not chosen to be in the final six must not have been discarded because of any analysis that utilized Y.
Thank you very much for taking the time to share your thoughts on the matter. I realise that I applied some sort of subject knowledge to select the six variables, but whether or not they actually have a statistically significant effect on the response is another matter. That is what I wanted to investigate. However, admittedly, I am not sure why it is incorrect to apply methods, such as backward elimination, forward selection, bidirectional elimination, and best subsets regression, all using AIC. These methods, by the way, suggest the same subset of variables.
If these are generally bad, why would they be introduced in courses on regression? One such course is STAT 501 Regression Methods from PennState Eberly College of Science. Stepwise regression and best subsets regression are introduced here. And from what I understand, they are quite popular.
Since I did not really understand why the statement “For your purpose, AIC would only be useful if you compared two completely pre-specified models” was true, I discussed the matter with a friend of mine, who recently finished his PhD in multivariate statistical modelling. His comment was: “It is only his opinion.”
He mentioned that “there are a lot of strategies for model selection” and that “there is no consensus on which one should be used for a particular data analysis.”
While he mentioned that no strategy is perfect, he also mentioned that it does not mean they are not useful:
“Of course, there are critics of this strategy, however, all other strategies are as subjetive as the one that you used. There are some strategies that in general work better and consequently are more popular.”
One concern I had was that the le Cessie–van Houwelingen–Copas–Hosmer unweighted sum of squares test indicated a poor fit. However, I am not interested in prediction (classification). I am interested in the regression coefficients, and consequently, the interpretation of the effects of the explanatory variables.
My friend mentioned that the low predictive power just means that there are a lot of other explanatory variables that I did not measure that also affect the response variable. Thus, I think it may be safe to ignore that p-value.
The fact that a huge number of methods that are known not to work are still taught in stat courses is a bit out of my control. I wrote the book Regression Modeling Strategies to try to correct the situation. As I’ve tried to say politely, your entire premise is ill-advised. Your situation is particularly ill-suited for variable selection because of the low dimensionality of your setting. And do you really have trouble interpreting regression coefficients that are small? You seem to be stuck on these issues and are not truly interested in advice. So it’s best to end the discussion here. If you become facile at doing simulations to support any ill-advised methods that you want to use you will soon learn what a disaster some methods are. If you don’t want to trust theory or simulations, it’s best that you work by yourself.
Note that variable selection using AIC is identical to variable selection using p-values, but just with a higher nominal type I error than usual (e.g., 0.13 instead of 0.05). So AIC doesn’t solve any problem caused by stepwise methods, and if the number of candidate models exceeds 4, AIC is not reliable in choosing the “right” model.
Very well put. I need to make a shorter version but I have a lot of material here, especially the example where I bootstrap ranks of predictors, showing it’s very hard when there are on only 12 candidate features in a high signal:noise ratio situation. The shortest answer I know of is that the data do not possess the information needed for them to tell you which elements of the data are important.
Of course I am truly interested in advice, but since I get somewhat different advice from another statistician, I cannot simply accept one opinion as the undisputed truth, and to the best of my belief, there is no such thing.
I do not by any means claim that you are wrong, but I am trying to understand why you think my situation is particularly ill-suited for variable selection, etc. I thought the low dimensionality would be an advantage?
I read an article on the pitfalls of automated variable selection by Sainani indicating that low dimensionality would actually be an advantage (since he suggests reducing dimensionality using PCA prior to stepwise regression):
Automatic selection procedures are less likely to cause overfitting if the sample size is large relative to the number of candidate predictor variables. (A common rule-of-thumb is to aim for at least 50 observations per candidate predictor in linear regression and 50 events per candidate predictor in logistic regression).
To reduce the pool of candidate predictors, researchers may apply data reduction techniques. For example, principal components analysis reduces a group of variables into a smaller set of uncorrelated variables called principal components. Applying stepwise regression to these principal components both reduces the risk of overfitting and the risk of selecting the wrong variables due to correlation between predictor variables (multicollinearity).
I have ~750 observations and 6 candidate predictor variables, so I would say the sample size is large relative to the number of candidate predictor variables (i.e. 6). At least according to the rule-of-thumb.
I checked VIFs for the six candidate explanatory variables to be sure there was no multicollinarity issue, but they did not warrant further investigation (the maximum VIF is 1.54).
Moreover, Sainani wrote:
Unless the explicit aim of an analysis is to unearth novel hypotheses, researchers should restrict the variables under consideration to those that have a plausible link with the outcome.
This is exactly what I did.
Finally, I compared the reduced model to the full model using a likelihood ratio test. This test was indeed non-significant indicating that the discarded variable did not improve model fit significantly. Is this also unreliable?
From what I understand, these methods are popular in the literature?
I spent an amazing amount of time researching the literature and doing simulation studies to find out what doesn’t work. I wish you had read the key modeling strategies chapter in my book because these issues were all covered, long ago. And this bootstrapping example in Chapter 5 directly addresses the issue of unreliability of feature selection:
You’ll get a lot of information from studying how much simulation work an advocate of a dangerous statistical method has done. The majority of those touting unpenalized stepwise regression have never done a simulation study to back up theiir beliefs.
Yes, discarding variables through statistical testing is simply unreliable. The likelihood ratio test you did is essentially testing whether the baseball team with the higher score won the game, after the fact. You are using circular reasoning.
Restricting attention to plausible variables is a really good idea but it cannot be based on supervised learning, which is like shooting the side of a barn then drawing a target around the “bullseye”.
Variable selection (when unpenalized by elastic net, etc.) does not help with overfitting and in fact often makes overfitting worse except in the special situation in which there is a large number of truly zero importance variables and the other variables have large importance.
Co-linearity does not affect prediction.
Yes stepwise methods are popular in the literature. Half of the people that live in my state don’t believe in evolution. Should I start listening to them because of their numbers?
The easiest way to sum this up is the note that the unbiased estimate of the residual variance (which goes into everything you are doing) divides the sum of squared errors by n-p-1 where p is the number of variables tested. The approach you are using results in a biased (low) \hat{\sigma}^2. Also, when you fit a model and some of the variables have small coefficients and/or large p-values, there is nothing wrong with that.
The original motivation for removing variables, out of an already small set, escapes me.
The popularity of a method is not directly correlated to its merit. Things are generally popular because they are easy and/or because they leaked into the early 1-2 statistics courses that most non-statisticians take, and get widely applied in their own literature, and then remain there for years due to inertia more than anything (for another example, p-values appear in the Table 1 of nearly every RCT paper despite further reflection showing that this is clearly silly).
You are doing a good thing here by probing and asking questions in a forum designed to do this. I encourage you to continue doing so and retaining an open mind, rather than regurgitating concepts taught in one or two early statistics courses or assuming that whatever you’ve read in the literature is correct. Tons of people use stepwise regression; almost none of them can explain the issues described here; they are not doing it because they have carefully evaluated the method and concluded that it is best suited for their problem, they are doing it because they think it’s what they are supposed to do based on a very limited understanding of its strengths and limitations.
I work with orthopaedic surgeons and they generally have an idea of the predictors in the model.
In working with rare outcomes, the dataset does not contain enough samples to satisfy the 15 events per predictor rule of thumb for the prespecified variables. The calibration slope plot also looks way off. Should I be worried about overfitting even with pre-specified predictors?
If the dependent variable is binary or time to a rare binary event, then the needed sample size goes up. The papers by Richard Riley are the best ones for sample size estimation for predictive modeling. To your question, pre-specification of predictors is a great idea, but you need to limit the number that are pre-specified or serious overfitting can occur. Data reduction (unsupervised learning) is one of the better ways to reduce the problem to what the sample size will allow.
When we report, do we then need to present the univariate analysis? We have an a priori list of candidate variables and pre specified models. Do we just need to present the multivariable model and its results? If there is no data driven variable selection , then why showing the univariate results. This is something I was asked by a friend who strongly believes in the incorrect method of P value based variable selection