I repeated some feature selection procedures over multiple repeats of cross-validation (CV). To estimate feature importance I sorted the result of each fold by the absolute magnitude of the coefficient (e.g.: 5-fold CV repeated 30x = 150 possible β values per coefficient). Features where β = 0 are not selected and therefore do not have a rank.
I would like to use suitable summary statistics to describe the vector of ranks for each of the variables with the largest average (absolute) coefficients per feature selection technique (I determine the top 40 via average coefficient size across all folds for all CV repeats).
Many of the distributions of ranks are non-normal, with those with larger (absolute) average coefficients having more gamma-like distributions. Thus, I think using the median and 5% - 95% intervals is appropriate.
However, since many of the features’ coefficients are 0 across the repeated folds, the lengths of ranks also varies. For example, the most selected feature is retained in almost every run, whereas the variable with the 40th largest absolute coefficient is only retained 20 times, giving 20 rank positions.
Assuming I will present this information in a table, I thought of adding an extra column “number of times selected” as a solution to convey frequency information. However, I wondered if there was a more formal way to do this.
If we had normally distributed continuous data and were reporting means, the effect of sample size (in this case, number of times a variable is not eliminated) would be captured in the width of the curve and location of the confidence intervals. Might there be an equivalent of this for rank data?
Finally, since I’m dealing with multiple feature selection techniques and under different conditions and datasets, I’d like the procedure to be automatable. I mention this becomes some solutions I’ve come across whilst reading involve finding distributions similar to the observed distribution, but under my circumstances this is quite infeasible.
(P.S., describing the coefficient sizes instead of the ranks is also not appropriate since I will also be comparing with other techniques that rank but do not eliminate features)