Ordinally modeling a probability distribution

cs1972 · December 29, 2020, 1:58pm

Hi everyone,
This is my first post on this site. I hope my question is appropriate for this forum. I also must make clear from this first sentence that I am not a statistician (and that, therefore, my question might be naïve and my understanding of possible answers incomplete)

My question: I would be interested in building up an interpretable regression model for a response variable (Y) which is bounded (0,1). For this, I have 100 continuous predictors and a sample of 300 subjects (with no missing values at any variable).I have seen a similar question in this thread (https://stats.stackexchange.com/questions/120671/probability-distribution-of-y-in-regression). From that thread, it seems that a semiparametric model would allow me to estimate such a distribution. Therefore, I have consulted chapter 15 of Dr. Harrell’s book(Regression modeling strategies, 2nd edition). However, and due to my very limited knowledge on the matter, I have not been able to find the right information to build such a model. Moreover, I would be interested in identifying and ranking the most relevant predictors of Y (although on pages 117-118 of the same books, I have read of the difficulties of such ranks). Therefore, here I am kindly asking whether it could be possible that any of you would provide some help in this regard.

I am already thankful for any possible answer

f2harrell · December 30, 2020, 1:13pm

Carla, your sample size is too small by a factor of at least 5 to do what yuo want. It is best to do analyses that live within the confines of the available information content. For your situation I think it’s best to spend a lot of time first doing data reduction (unsupervised learning that ignores Y). That way you can reduce the dimensionality of the data by no longer separating X’s that are hard to separate. Then summary scores that represent non-overlapping information can be used in a much smaller model to predict Y.

Some good data reduction methods are variable clustering, principal components, and especially sparse principal components analysis. I demonstrate these in my RMS book and course notes.

Once you build a stable low-dimension model (say using 10 summary scores) there are several ways for interpreting the results on the original scales.

cs1972 · December 30, 2020, 2:15pm

Dear Dr. Harrell,
Thank you very much for your fast and clear answer.

I was fearing that my sample was too small for my goal. In this regard, I appreciate your suggestion of first reducing the dataset dimensionality by clustering methods. However, I am afraid that, in this particular case, such a strategy might be inconvenient as it would reduce the interpretability of the obtained model. Therefore I would prefer trying to select a reduced subset of predictors (around 10), although this strategy might not work. In any case, I will check in your RMS book for possible solutions.

Again, thank you very much for your clear answer.

f2harrell · December 30, 2020, 3:06pm

Carla you may be falling into the trap of “if it’s simple I can interpret it” when in fact the result would be interpretable only because it is wrong. The reduced subset of predictors will have a probability of zero of being the “right” variables. The result of feature selection in the low signal:noise ratio case is purely a mirage.

cs1972 · December 30, 2020, 4:29pm

Dear Dr. Harrell,
Thanks a lot for your comment. Without your advice, I could easily fall into that “trap”.
I guess I will have to follow the data reduction strategy that you had previously suggested and hope that the predictors’ clusters are more or less interpretable.
Again, thank you very much for taking the time to help me out.

f2harrell · December 30, 2020, 7:45pm

Good luck with the project. As discussed in RMS there are many descriptive tools you can use to interpret principal components, and there is a case study of sparse principal components in the book. One simple approach is to run stepwise variable selection just as a descriptive tool to predict one of the cluster summary scores.