So in the world of omics (genomics, proteomics, metabolomics) the goals are often hypothesis generation rather than confirmation. Yet in this field p-values are used everywhere and there seems to be a massive over reliance on them because they are “simple”. The problem is when the data is itself a “black box” it is very difficult to verify assumptions like linearity in particular.

Its difficult to know whether one should log transform the gene/protein/biomarker or not for every single marker analyzed 1 at a time. I have also seen it give different sets of “significant” results. Unfortunately doing some sort of test on each to check the assumptions and automate the procedure would invalidate the actual model inference.

I like Bayesian methods but they are difficult to explain and computationally not as feasible for like tens of thousands of features (Xs). Usually we look at the features 1 at a time too. I have looked into GAMs as well, but the problem is these are also computationally expensive.

Is there any good way to analyze this sort of data or is it the case that high dimensional data like this does not have any good solutions? I also find it kind of ironic that standard linear or logistic models are used for “interpretability” but the disparity in results with different methods of transformation on the feature exposure of interest even within these is anything but interpretable even compared to just throwing the whole thing into a random forest and taking the top features.

There is a small literature, which unfortunately I do not have in front of me, where multiple variables are used to represent each candidate feature in contexts like yours. For example each variable can be expanded into a restricted cubic spline function or just use a quadratic function of log predictor values. Computationally it is fastest to test the multiple degree of freedom measures of association these induce by the use of Rao efficient score tests which do not require the use of an iterative maximum likelihood method.

The way I’ve handled this is to fit several models (e.g. with/without a variable, transformation, interaction, etc.) And then look for genes/proteins/metabolites that consistently end up at the top of the ranked lists. The point of these omics experiments was usually to select a few genes to take forward for further experiments, so we wanted the genes to be robust to the model specification. Your use case might be different. If results differ considerably between models for some genes then it’s worth plotting the data to see what’s going on, but this is just for a small subset of the data.

Another option is to calculate a few metrics for each gene to assess the quality of the fit, usually based on the residuals. For example, max(abs(range(resid))) to pick-up outliers, or the p-value from a Shapiro-Wilk test for normality. I treat the p-value as a rough description of the normality of the residuals. The metrics for each gene are then saved in the results table and it’s easy to check if a top gene has anything usual that requires further investigation. The distribution of the Shapiro-Wilk p-values is also useful to assess if transformations are required and/or working as intended.

I personally don’t like multi-stage methods, and as far as entertaining various transformations, that does not honestly report model uncertainty. I’d rather see a parameter devoted to everything we don’t know which is why I suggested using a multiple degree of freedom test for each feature. An even more rational approach would use ridge regression where there are two penalty parameters: one for \log(X) terms and one for [\log(X)]^2 terms, the latter penalty probably being higher.

It may not solve your specific problem, but Bradley Efron has devoted a lot of attention to the problems of large scale testing and estimation in an Empirical Bayes framework, that builds upon the frequentist tools like p values. It should help clarify the problem. He has co-authored 2 monographs (at the top of this list). A number of case studies involve genetics and biochemistry specifically.

Yea having several models and just picking the best that overlap is an option though doesn’t seem very rigorous.

The ridge approach is something I have thought about but is there any way at all to get p-values for the model without using bootstrap? G comp/marginal effects could be used to obtain an effect size at least

With the unpenalized multiple DoF test I haven’t heard of it so will have to look into that too, in this case is there any effect size it reports or just a p value? I guess one could calculate a marginal effect size but not sure if the pvalue for this would differ from that of the DoF test.

While this is primarily a statistical question, from the clinical point of view there are multiple problems with omics analyses that I believe make it very difficult to interpret results, even when the statistical approach is robust.
a) candidate selection for omics analysis is poorly discussed. If there are 10 papers with whole-exome sequencing, correlating with outcomes, they will usually be convenience samples rather than truly representative of the population.
b) even if the p is restricted, because of low n, most individual comparisons are underpowered and therefore useless as a triaging step towards future studies. Some less common but very useful biomarkers may never be identified in small cohorts.
c) most studies are exploratory/‘discovery’ without any a priori hypothesis. Therefore interpreting their p values are a challenge.
d) I doubt if the majority of reviewers of such papers spend time looking at statistical aspects of the analysis. They are usually scientists who are focusing on lab methods and literature comparison.
I wonder, as a clinician, if such studies should restrict themselves to clustering, and then use prospective studies to see the impact of the clusters on outcomes.

You summarized the problems well. The result is a combination of non-reproducible research and lack of clinical/public health usefulness. The low frequency with which researchers pay any attention whatsoever to false negative rates is stunning. They seem to be so proud of their control of false discovery rates that they forget that power is a huge issue.