Summary statistics of rank data

ddk · January 17, 2023, 10:21am

I repeated some feature selection procedures over multiple repeats of cross-validation (CV). To estimate feature importance I sorted the result of each fold by the absolute magnitude of the coefficient (e.g.: 5-fold CV repeated 30x = 150 possible β values per coefficient). Features where β = 0 are not selected and therefore do not have a rank.

I would like to use suitable summary statistics to describe the vector of ranks for each of the variables with the largest average (absolute) coefficients per feature selection technique (I determine the top 40 via average coefficient size across all folds for all CV repeats).

Many of the distributions of ranks are non-normal, with those with larger (absolute) average coefficients having more gamma-like distributions. Thus, I think using the median and 5% - 95% intervals is appropriate.

However, since many of the features’ coefficients are 0 across the repeated folds, the lengths of ranks also varies. For example, the most selected feature is retained in almost every run, whereas the variable with the 40th largest absolute coefficient is only retained 20 times, giving 20 rank positions.

Assuming I will present this information in a table, I thought of adding an extra column “number of times selected” as a solution to convey frequency information. However, I wondered if there was a more formal way to do this.

If we had normally distributed continuous data and were reporting means, the effect of sample size (in this case, number of times a variable is not eliminated) would be captured in the width of the curve and location of the confidence intervals. Might there be an equivalent of this for rank data?

Finally, since I’m dealing with multiple feature selection techniques and under different conditions and datasets, I’d like the procedure to be automatable. I mention this becomes some solutions I’ve come across whilst reading involve finding distributions similar to the observed distribution, but under my circumstances this is quite infeasible.

(P.S., describing the coefficient sizes instead of the ranks is also not appropriate since I will also be comparing with other techniques that rank but do not eliminate features)

f2harrell · January 17, 2023, 1:58pm

The length of ranks cannot vary. You are doing conditional analyses which will greatly underestimate the difficulty of the task. Use the bootstrap to get confidence intervals for ranks where you include non-selected features in every resample, giving them a “bad” rank. See Biostatistics for Biomedical Research - 20 Challenges of Analyzing High-Dimensional Data

Start off with no variable selection and bootstrap an importance measure such as log likelihood or estimated regression coefficient (if all features are on the same scale).

ddk · January 17, 2023, 4:56pm

Thanks for your time Professor Harrell. I’ve read various chapters in the link you provide - it’s all really excellent stuff. I will read the chapter you link again to see if I get anything new from it.

If you could spend a minute to explain what exactly is wrong with the length of the ranks varying, I’d greatly appreciate it. Since these methods are sparse (e.g., Lasso), some variables are eliminated, but - as you talk about often - the variables that are retained and eliminated in each repeat of cross-validation varies. (Thus, in my current situation, the length of the ranks also represents the non-zero frequency, which is also informative but in a slightly different way).

Since usually the majority of features are excluded, how could I give a rank between these? This would mean hundreds of variables would be tied for last place in every fold and every repeat of cross-validation.

Thanks a lot, Professor.

f2harrell · January 17, 2023, 6:39pm

Ranking only “winners” when different “winners” are selected at each bootstrap repetition means that your analysis is conditioning on “winning”. Think about how you would get an unbiased standard error of a coefficient with the bootstrap when doing stepwise variable selection: you’d treat non-selected variables as having \hat{\beta}=0 and put those zeros in the vector of \hat{\beta} when computing the standard deviation across bootstrap samples.

ddk · January 17, 2023, 8:23pm

Yeah I guess that was an important part of my question then. In the case of estimating a coefficient, one would include instances where β = 0, and indeed it is logical to me why (and when calculating the average coefficient across re-sample, I do include 0 coefficients). But when transferring to the case of ranks, it isn’t clear to me how to do this since the majority of variables will be tied for last place.

Thus, is is suitable to set all variables for which the coefficients are shrunken to 0 to rank = P (i.e., lowest rank possible) per resample? My instinct tells me this wouldn’t be right since we would encounter situations where ranks for a given variable might be P for most resamples but with the occasional rank position a few hundred higher than P…

f2harrell · January 18, 2023, 2:17pm

Yes, having a lot of ties at the worst rank is OK, but variable selection in general is a bad idea and ranking of variable importance without screening of variables will work better and will expose whether variable selection could actually work.

ddk · January 18, 2023, 2:50pm

Actually, inspired by much of your work, I’m trying to do do exactly that - use a robust statistical procedure to evidence the difficulty of feature selection when n >> p. So, for that I think I should have at least some models that perform selection, rather than using only rank-based ones. I just need to make sure I implement them correctly and make them as comparable as possible, hence my question in the thread. I think I got the answers I was looking for. Thanks Professor.