How to rank parameters and choose "the best"?

Maybe it is a simple, beginner question, but how do I find “the best” predictor out of a couple of proteins?
I have around 30 proteins. And they all are related with a system. Now I have longitudinal data and want to know, which protein is the best predictor for my “event-endpoint”.
One idea I had would be to do 30 univariable models, and rank them by their likelihood-ratio and choose the one with the highest?
Instead of 30 univariable models, 30 multivariable models could also be an idea?

Or would it be better to look at the - for multiple testing adjusted - p-values and take the lowest one?

Normally I have a main research question and dont have this type of “data exploration”. The point here is, I think I already know the answer which protein is the “most important” one for prediction, and however I would proceed (hopefully) this remains, but I have to argue, why I dive deeper into only this protein and not another one.

This is a very common type of research, and quite problematic because (1) the data seldom contain enough information to make these discernments, and (2) collinearities may choosing almost impossible, resulting in random choices. As demonstrated in Chapters 2 and 4 of RMS you need to put uncertainty intervals around any rank or metrics you compute, using the bootstrap or Bayesian methods, and you’ll be quite sickened about the width of these intervals. And ranking LR \chi^2 is one of the best things you can rank.

This kind of problems calls for a drastically different approach that allows you to answer simpler questions with findings that will just blow in the wind. Think of this more of a problem of data reduction (unsupervised learning) and finding the dimensions/projections of the 30 proteins that captures most of the information and prevents you from trying to separate proteins that are inseparable. Sparse principal components analysis and redundancy analysis, supplemented by simple variable clustering dendrograms, will be worth considering. These are exemplified in case studies in RMS.

You might also do well to first do a 30 d.f. LR \chi^2 test to see if there is any support for any of this. That test is perfectly multiplicity-corrected.

7 Likes

People sometimes also use a nonparametric statistical estimator like a random forest and then assess variable importance by, for example, removing each protein and then seeing how much this removal impacts prediction in held out data.

It could be an interesting additional investigation to complement the approaches described above.

As mentioned above, uncertainty is crucial - as I recall, random forests sort of naturally bootstrap, but it would be worth bootstrapping as well in an outer loop.

When you add uncertainty intervals to those RF measures you’ll usually find that the analysis tells you almost nothing about variable importance.

1 Like

Hi,
thank you guys for this interesting discussion about variable importance.

One thing that I have struggled with when assessing variable importance is the interpretation of he term “variable importance”. In Cox models for example I believe that one can address “variable importance” as 1) the predictive capacity of a variable or as 2) the impact of the variable on hazard (i.e. the value of the coefficients in the regression using standardized covariates).

I understand that the LR X2 addresses the predictive capacity of the variables, the same as random forest method indicated by @ amw235711

Indeed, I checked both methods for the same dataset and sadly I got large confidence intervals… just to confirm that the assessment of variable importance in my case study is being a difficult task.

With regards the 2) interpretation of variable importance, is it correct to assess it by looking at the coefficient values of each standardized) variable (? is there any other method to assess it?

Standardized coefficients are slightly helpful but I like to look at either the proportion of log-likelihood explained or the proportion of predicted values explained by a subset of predictors (or one predictor).

1 Like

The OP states that she kind of already knows the answer. Let’s assume, we therefore conclude that there is a high signal–to-noise ratio and that for her laboratory work she dies not need to find the most likely candidate protein. What is the best formal way to go? Thirty regressions with one predictor left out or standardized regression coefficients or LASSO or Bayesian regression with Laplace priors?

Sometimes we do not need to find the best set of predictors but just one of the better smaller ones. Drawing uncertainty intervals around the assigned rank may be disappointing but statisticians shoukd not give up just because there is noise, right?

If uncertainty intervals around importance ranks, or better, uncertainty intervals on good measures of relative importance, are disappointed, this exposes the difficulty of the task the the researcher should not be optimistic for using the data to make discoveries in this way. It’s better to use the data to make discoveries that will be stable, e.g., finding inter-connected themes that are predictive using biological pre-specification or data reduction (unsupervised learning).

1 Like

Very fun just published paper here on this fundamental trade-off between learning and error assessment.

1 Like

Besides the problems highlighted by Frank, this is almost the same procedure Brad Efron used in this paper:

Efron, Bradley. “Prediction, Estimation, and Attribution.” International Statistical Review 88, no. S1 (2020): S28–59. https://doi.org/10.1111/insr.12409.

Can we use the importance scores for attribution, as with the asterisks in Table1? In this case, the answer seems to be no. I removed gene1031 from the dataset, reducing the data matrix x to 102 × 6032, and reran the random Forest prediction algorithm. Now the number of test set prediction errors was zero. Removing the most important five genes, the most important 10,… ,the most important 348 genes had similarly minor effects on the number of test set prediction errors, as shown in Table 3.

At the final step, all of the genes involved in constructing the original prediction rule of Figure 5 had been removed. Now x was 102 × 5685, but the random forest rule based on the reduced dataset d = {x,y} still gave excellent predictions. As a matter of fact, there were zero test set errors for the realization shown in Table 3. The prediction rule at the final step yielded 364 “important” genes, disjoint from the original 348. Removing all 712 = 348 + 364 genes from the prediction set —so now x was 102 × 5321 — still have a random forest prediction rule that made only one test set error.

The “weak learners” model of prediction seems dominant in this example. Evidently there are a great many genes weakly correlated with prostate cancer, which can be combined in different combinations to give near-perfect predictions. This is an advantage if prediction is the only goal, but a disadvantage as far as attribution is concerned.

This is extremely interesting but isn’t it based on a discontinuous accuracy scoring rule? That tends to create misleading results sometimes.

The standardized coefficient should be greatly affected by the parameter distribution. If it is not a very standard normal distribution, is it reasonable to use the standardized coefficient? A colleague of mine used the Z score for the skewed distribution, which I thought was strange, but it seems that others don’t think so.(I saw you mentioned standardized coefficients so I asked, this question is not about choosing the best)
image

Most analysts automatically standardize by the standard deviation. SD assumes a symmetric distribution, otherwise you have a different SD for below-the-mean observations compared to those above. I don’t trust standardization. Fortunately with the log-likelihood function (or R^2 for linear models) this is a non-issue. The log-likelihood (actually the deviance) gives you the best importance measure and it handles mutliple degrees of freedom such as spline terms.

1 Like

Thank you very much.
I would like to confirm whether I have correctly understood your comments:
1, Using standard deviation is of little use in a context of a normal distribution. However, it is not reliable, and that is a bad way in skewed distribution.
2. It is better using log-likelihood function, as your paper mentioned. The adequacy index and relative explained variation are better ways to show parameter ranking and their contribution proportions (another good way is principal components analysis, which aggregates variables into some components using all variables rather than excluding some of them).