Why significant variable doesn't improve model performance?

Akshay_Kumar · January 2, 2020, 1:08am

Hello Everyone,

I am new to this site and new to clinical data analysis as well. I am an IT person and relatively new to stats and clinical data analysis. Any simple explanation would really be helpful?

I have a binary classification problem with 5K records and 60+ features/columns/variables. dataset is slightly imbalanced (or not?) with 33:67 class proportion

What I did was

1st ) Run a logistic regression (statsmodel api in python) with all 60+ columns as input (meaning controlling confounders) and find out the significant risk factors (p<0.0.5) from result(summary output). So through this approach, I don’t have to worry about confounders because confounders are controlled via multi variable regression. Because I have to know that my risk factors are significant as well i.e causal modeling. Meaning build a predictive model on the basis of causal features. I say this because in a field like medical science/clinical studies, I believe it is also important to know the causal effect. I mean if you wish to publish in a journal, do you think we can just list the variables based on feature importance approach (results of which differ for each FS approach). Ofcourse, I find some common features across all feature selection algorithm. But is this enough to justify that this a meaningful predictor? Hence, I was hoping that p-value would convince and help people understand that this is significant predictor. Any thoughts on this?

2nd ) Use the identified 7 significant risk factors to build a classification ML model

3rd ) It yielded an AUC of around 82%

Now my question is

1 ) Out of 7 significant factors identified, we already know 5 risk factors based on domain experience and literature. So we are considering the rest 2 as new factors which we found. Might be because we had a very good data collection strategy (meaning we collected data for new variables as well which previous literature didn’t have)

2 ) But when I build a model with already known 5 features, it produces an AUC of 82.1 . When I include all the 7 significant features, it still produces an AUC of 82.1-82.3 or sometimes, it even goes down to 81.8-81.9 etc. Not much improvement. Why is this happening?

3 ) If it’s of no use, how does statsmodel logistic regression identified them as significant feature (with p<0.05)?

4 ) I guess we can look at any metric. As my data is slightly imbalanced (33:67 is the class proportion), I am using only metrics like AUC and F1 score. Should I be looking at any other metric only?

5 ) Should I balance the dataset because I am using statsmodel Logistic regression to identify the risk factors from the summary output? Because I use tree based models later to do the classification which can handle imbalance well, so I didn’t balance.Basically what I am trying to know is even for `significant risk factor identification using statsmodel logistic regression, should I balance the dataset?

6 ) Can you let me know what is the problem here and how can I address this?

7 ) How much of an improvement in performance is considered as valid/meaningful to be considered as new findings?

8) How can I justify that new features that we found are meaningful and really do explain the change in outcome?

Basically what I am trying to achieve is

a) prove that my new risk factors are meaningful and significant in influencing the outcome

b) Use those causal features for a prediction model

f2harrell · January 2, 2020, 1:04pm

Akshary, to understand the big picture, think of me as a statistician trying to master your IT job in an afternoon without studying the principles and methods that pertain to it. The statistical approach and statistical assumptions you are making are frought with problems, and I suggest that if you want to be good at statistical modeling and quantifying predictive accuracy you should take several steps back. You might study my Regression Modeling Strategies book and course notes and the various blog articles I’ve posted on classification. Some of the many issues are these:

The dataset you are analyzing is not set up for causal inference
Logistic regression is for probability estimation, not classification
Balance is irrelevant once you use a proper accuracy scoring rule
See this for sound statistical ways to measure improvements in model performance

But study the general methods first. Also see hundreds of related questions (with discussion) as http://stats.stackexchange.com.