Hello Everyone,
I am new to this site and new to clinical data analysis as well. I am an IT person and relatively new to stats and clinical data analysis. Any simple explanation would really be helpful?
I have a binary classification problem with 5K records and 60+ features/columns/variables. dataset is slightly imbalanced (or not?) with 33:67 class proportion
What I did was
1st ) Run a logistic regression (statsmodel api in python) with all 60+ columns as input (meaning controlling confounders) and find out the significant risk factors (p<0.0.5) from result(summary output). So through this approach, I don’t have to worry about confounders because confounders are controlled via multi variable regression. Because I have to know that my risk factors are significant as well i.e causal modeling. Meaning build a predictive model on the basis of causal features. I say this because in a field like medical science/clinical studies, I believe it is also important to know the causal effect. I mean if you wish to publish in a journal, do you think we can just list the variables based on feature importance approach (results of which differ for each FS approach). Ofcourse, I find some common features across all feature selection algorithm. But is this enough to justify that this a meaningful predictor? Hence, I was hoping that p-value would convince and help people understand that this is significant predictor. Any thoughts on this?
2nd ) Use the identified 7 significant risk factors to build a classification ML model
3rd ) It yielded an AUC of around 82%
Now my question is
1 ) Out of 7 significant factors identified, we already know 5 risk factors based on domain experience and literature. So we are considering the rest 2 as new factors which we found. Might be because we had a very good data collection strategy (meaning we collected data for new variables as well which previous literature didn’t have)
2 ) But when I build a model with already known 5 features, it produces an AUC of 82.1
. When I include all the 7 significant features, it still produces an AUC of 82.1-82.3
or sometimes, it even goes down to 81.8-81.9
etc. Not much improvement. Why is this happening?
3 ) If it’s of no use, how does statsmodel logistic regression identified them as significant feature (with p<0.05)?
4 ) I guess we can look at any metric. As my data is slightly imbalanced (33:67 is the class proportion), I am using only metrics like AUC and F1 score. Should I be looking at any other metric only?
5 ) Should I balance the dataset because I am using statsmodel Logistic regression to identify the risk factors from the summary output? Because I use tree based models later to do the classification which can handle imbalance well, so I didn’t balance.Basically what I am trying to know is even for `significant risk factor identification using statsmodel logistic regression, should I balance the dataset?
6 ) Can you let me know what is the problem here and how can I address this?
7 ) How much of an improvement in performance is considered as valid/meaningful to be considered as new findings?
8) How can I justify that new features that we found are meaningful and really do explain the change in outcome?
Basically what I am trying to achieve is
a) prove that my new risk factors are meaningful and significant in influencing the outcome
b) Use those causal features for a prediction model