Hello kind people,
This is my first post here, having previously used the forum for searching for (and finding!) answers for many a statistical question.
I am doing research on how initial ECG-findings are associated with the outcome and cause of cardiac arrest, using data from a national registry. The analysis is mostly descriptive (demographics, comorbidities and outcome), grouped by initial ECG-findings.
However, we also have data on the cause of the arrest. Which we are relating to the initial ECG-findings, the hypothesis being that certain causes are associated with certain initial ECG-findings.
Also, we are doing a RandomForest analysis to compare “variable importance” in predicting outcome (survival to hospital discharge) in the different ECG-subsets.
So to my two questions. The “cause”-variable has a missing data proportion of 43%. There is also a category called “unknown” which is very small, about 1-2%, which has led me to believe a lot of the unknowns are in the missing category.
- How would you treat this missingness in your RandomForest analysis?
First i imputed the variable, but later changed my mind as a) the data probably is NMAR and b) 100% knowledge of the causes of all arrests is unrealistic.
Therefore, I recoded the all missing values to a new category in the cause-variable called “missing”, which I understand can be called a “missing indicator approach” (?) - if accompanied by a dummy variable indicating missingness.
I understand this approach can lead to bias in regression models, in decision tree models however, I couldn’t find any advice.
- I have read somewhere (although now I can’t seem to find where) that RandomForests overestimate the “variable importance” in categorical variables with many categories. My “cause”-variable has 33 categories and thus would be at risk of overestimation of importance. Is this true and how could one adjust for this “inflation”?