I’m stuck on a consulting problem. I’ve been asked to help with some ‘data analysis’ on a large observational study with the following features:
1000s of noisy observations of both outcomes and 100s of variables thought to be associated in some way with the outcome.
Although large number of observations, many variables are at a group level with only 10s of groups, so really 100s of variables with 10s of observations.
Very little knowledge of the causal structure of variables. Whenever I ask, it’s always 'yes, X could plausibly affect Y (or Z, or W etc. etc), but possibly not.
Outcome not actually defined/multiple possible outcomes (this is the least of my problems however and I won’t address this further)
I have been asked to help ‘find associations to generate hypotheses’.
Question what information can be extracted from a study of this sort?
Some thoughts:
We have no specific hypotheses to test, so methods for finding adjustments given a particular exposure seem useless. Unless we pick candidates and use some kind of confounder selection method (DAG etc.).
My suggestion is to use a shrinkage method, e.g. Lasso (Bayesian or not) on all variables. this is tricky because of hierarchical nature of data and the sheer number of variables and unknown associations between the variables.
What I really want to say is that there’s not much beyond exploratory analysis that can be done here.
Am I too pessimistic? Are there good examples of papers based on this type of study design?
I think you’ve conceptualized the problem well and thought of the key issues. My take: Not once in my career has it been fruitful to work with an investigator who says to me “I have data and would like to get a paper out of it.”
i would begin by cleaning the data, running many data checks on them, and this will get you thinking about the data and what might be worth exploring, or how to present them
Why not send them away and ask them to come back with a testable and important hypothesis? Get them to justify the hypothesis on the basis of existing literature. Check the dataset has the right variables to have a chance of refuting it. You then help them refine the analysis plan. Then register the whole thing on AsPredicted.org or similar. Certify you’ve not looked at the data, Get it date stamped. Then set to work.
And don’t forget that @jimgthornton’s pre-registration idea is very important. At the very least have a written, dated, signed, statistical analysis plan before analysis begins.
As many has pointed out, you’d need to start with a research question/hypothesis. The first pillar of scientific inquiry, at least in the traditional sense, is to define the problem you are trying to answer. I have encountered similar scenarios in my line of work where I am presented with a database or a method and I have been asked to conduct a study. I always push back and remind them that things do not work that way and I explain why. In the majority of situations, people do understand. The majority wants to do good research but they do not know how. Hence an important part of our job is to educate.