# Ordinally modeling a probability distribution

Hi everyone,
This is my first post on this site. I hope my question is appropriate for this forum. I also must make clear from this first sentence that I am not a statistician (and that, therefore, my question might be naïve and my understanding of possible answers incomplete)

My question: I would be interested in building up an interpretable regression model for a response variable (Y) which is bounded (0,1). For this, I have 100 continuous predictors and a sample of 300 subjects (with no missing values at any variable).I have seen a similar question in this thread (https://stats.stackexchange.com/questions/120671/probability-distribution-of-y-in-regression). From that thread, it seems that a semiparametric model would allow me to estimate such a distribution. Therefore, I have consulted chapter 15 of Dr. Harrell’s book(Regression modeling strategies, 2nd edition). However, and due to my very limited knowledge on the matter, I have not been able to find the right information to build such a model. Moreover, I would be interested in identifying and ranking the most relevant predictors of Y (although on pages 117-118 of the same books, I have read of the difficulties of such ranks). Therefore, here I am kindly asking whether it could be possible that any of you would provide some help in this regard.

Carla, your sample size is too small by a factor of at least 5 to do what yuo want. It is best to do analyses that live within the confines of the available information content. For your situation I think it’s best to spend a lot of time first doing data reduction (unsupervised learning that ignores Y). That way you can reduce the dimensionality of the data by no longer separating X’s that are hard to separate. Then summary scores that represent non-overlapping information can be used in a much smaller model to predict Y.

Some good data reduction methods are variable clustering, principal components, and especially sparse principal components analysis. I demonstrate these in my RMS book and course notes.

Once you build a stable low-dimension model (say using 10 summary scores) there are several ways for interpreting the results on the original scales.

Dear Dr. Harrell,

I was fearing that my sample was too small for my goal. In this regard, I appreciate your suggestion of first reducing the dataset dimensionality by clustering methods. However, I am afraid that, in this particular case, such a strategy might be inconvenient as it would reduce the interpretability of the obtained model. Therefore I would prefer trying to select a reduced subset of predictors (around 10), although this strategy might not work. In any case, I will check in your RMS book for possible solutions.