I’m stuck on a consulting problem. I’ve been asked to help with some ‘data analysis’ on a large observational study with the following features:
- 1000s of noisy observations of both outcomes and 100s of variables thought to be associated in some way with the outcome.
- Although large number of observations, many variables are at a group level with only 10s of groups, so really 100s of variables with 10s of observations.
- Very little knowledge of the causal structure of variables. Whenever I ask, it’s always 'yes, X could plausibly affect Y (or Z, or W etc. etc), but possibly not.
- Outcome not actually defined/multiple possible outcomes (this is the least of my problems however and I won’t address this further)
- I have been asked to help ‘find associations to generate hypotheses’.
Question what information can be extracted from a study of this sort?
- We have no specific hypotheses to test, so methods for finding adjustments given a particular exposure seem useless. Unless we pick candidates and use some kind of confounder selection method (DAG etc.).
- My suggestion is to use a shrinkage method, e.g. Lasso (Bayesian or not) on all variables. this is tricky because of hierarchical nature of data and the sheer number of variables and unknown associations between the variables.
- What I really want to say is that there’s not much beyond exploratory analysis that can be done here.
Am I too pessimistic? Are there good examples of papers based on this type of study design?
Thanks in advance.