So in the world of omics (genomics, proteomics, metabolomics) the goals are often hypothesis generation rather than confirmation. Yet in this field p-values are used everywhere and there seems to be a massive over reliance on them because they are “simple”. The problem is when the data is itself a “black box” it is very difficult to verify assumptions like linearity in particular.
Its difficult to know whether one should log transform the gene/protein/biomarker or not for every single marker analyzed 1 at a time. I have also seen it give different sets of “significant” results. Unfortunately doing some sort of test on each to check the assumptions and automate the procedure would invalidate the actual model inference.
I like Bayesian methods but they are difficult to explain and computationally not as feasible for like tens of thousands of features (Xs). Usually we look at the features 1 at a time too. I have looked into GAMs as well, but the problem is these are also computationally expensive.
Is there any good way to analyze this sort of data or is it the case that high dimensional data like this does not have any good solutions? I also find it kind of ironic that standard linear or logistic models are used for “interpretability” but the disparity in results with different methods of transformation on the feature exposure of interest even within these is anything but interpretable even compared to just throwing the whole thing into a random forest and taking the top features.