Data transformation before fitting regularized time-to-event models

Elisabetta · May 23, 2022, 1:47pm

Dear all,
I am writing because I would like to fit a regularized model to a set of data characterized by the presence of a high number (as compared to observations and events) of biomarkers, microRNAs, and a to time-to-event endpoint. We performed a mirnome analysis using the RNA-seq technology, we filtered a bit the data before normalization, and finally we normalized the data using the TMM method included in edgeR R package and computed cpm.

To proceed further with the mainstream statistical analyses, I would like to transform the cpm data using a variance stabilizing techniques to break the still present relationship between the variance and the mean.
Do you have any suggestion for doing that? Does exist any package for doing so given the type of data I have and the package I have used to normalized the data?

Thank you very much in advance for any help you can give me.

Elisabetta

f2harrell · May 23, 2022, 2:02pm

Normalization choices seem to be quite arbitrary with lots of people relying on standard deviations even for asymmetric data. One thing to check on: did the “filtering before normalization” involve any feature selection? To use regularization (better called penalization) methods requires use of the full candidate features set, and doesn’t work if any features are not seen by the penalization method.

Elisabetta · May 23, 2022, 2:52pm

Reading few articles, especially those related to the analysis of miR-seq data characterized by many miRNAs having very skew distributions, the choice of the type of transformation seems to be a quite critical step, affecting in a non-trivial way the results. Moreover, sequencing data still show a dependence between the mean and the variance even after normalizing the raw counts. So, I was wondering if it was better to apply a variance stabilizing normalization in place of a log2(x+c) transformation.

However, using edgeR package I could not found the way to apply such a transformation that instead is possible using the deseq package.

Filtering was applied before normalization in an unsupervised way, that is eliminating the miRNAs with a value of the third quartile less than 5 counts. Is this wrong?
In this kind of experiments, there are a lot of biomarkers showing very low counts for few individuals (in my case, with the filter above, more than 50%). I know that the choice of how to filter can be arbitrary but I am not sure that taking all the miRNAs is exempt from other problems. For example the selection of such types of miRNAs by elastic-net may render problematic their detection and thus validation in independent set of data due to they very low expression.
Any suggestion would be nice.
E

f2harrell · May 23, 2022, 9:21pm

There are statisticians devoting solely to analysis of complex mRNA data. I hope we hear from some of them. This stuff is pretty complex.