Observational study with no hypotheses

Hi everyone,

I’m stuck on a consulting problem. I’ve been asked to help with some ‘data analysis’ on a large observational study with the following features:

  • 1000s of noisy observations of both outcomes and 100s of variables thought to be associated in some way with the outcome.
  • Although large number of observations, many variables are at a group level with only 10s of groups, so really 100s of variables with 10s of observations.
  • Very little knowledge of the causal structure of variables. Whenever I ask, it’s always 'yes, X could plausibly affect Y (or Z, or W etc. etc), but possibly not.
  • Outcome not actually defined/multiple possible outcomes (this is the least of my problems however and I won’t address this further)
  • I have been asked to help ‘find associations to generate hypotheses’.

Question what information can be extracted from a study of this sort?

Some thoughts:

  1. We have no specific hypotheses to test, so methods for finding adjustments given a particular exposure seem useless. Unless we pick candidates and use some kind of confounder selection method (DAG etc.).
  2. My suggestion is to use a shrinkage method, e.g. Lasso (Bayesian or not) on all variables. this is tricky because of hierarchical nature of data and the sheer number of variables and unknown associations between the variables.
  3. What I really want to say is that there’s not much beyond exploratory analysis that can be done here.

Am I too pessimistic? Are there good examples of papers based on this type of study design?

Thanks in advance.


I’m no statistician, but this sounds a lot like data dredging.

1 Like

This is exactly what I’m worried about. I’m used to hypotheses and direct, if complicated, questions.

I think you’ve conceptualized the problem well and thought of the key issues. My take: Not once in my career has it been fruitful to work with an investigator who says to me “I have data and would like to get a paper out of it.”


Thanks for that, I appreciate it.

FYI- thought this recent publication was great - might be worth passing along to the people you’re consulting for…


i would begin by cleaning the data, running many data checks on them, and this will get you thinking about the data and what might be worth exploring, or how to present them


Why not send them away and ask them to come back with a testable and important hypothesis? Get them to justify the hypothesis on the basis of existing literature. Check the dataset has the right variables to have a chance of refuting it. You then help them refine the analysis plan. Then register the whole thing on AsPredicted.org or similar. Certify you’ve not looked at the data, Get it date stamped. Then set to work.


Thanks, the analysis has already begun but yes, the I’m at the point of insisting them to give me a hypothesis to test.

And don’t forget that @jimgthornton’s pre-registration idea is very important. At the very least have a written, dated, signed, statistical analysis plan before analysis begins.

Tell your client the question (hypothesis) comes first–unless he/she is a big fan of the Texas Sharpshooter Fallacy…

1 Like

As many has pointed out, you’d need to start with a research question/hypothesis. The first pillar of scientific inquiry, at least in the traditional sense, is to define the problem you are trying to answer. I have encountered similar scenarios in my line of work where I am presented with a database or a method and I have been asked to conduct a study. I always push back and remind them that things do not work that way and I explain why. In the majority of situations, people do understand. The majority wants to do good research but they do not know how. Hence an important part of our job is to educate.

Good luck maneuvering this situation.