Cause selection while modeling a Graphical Causal Model

Ramesh_Iyer · January 17, 2021, 5:13pm

Hi There!

I have recently started toying around with Causal Models specifically the kinds from Judea Perl school of thought. In practical application however, I find that there are several numerous causes to an outcome. Since the approach (and libraries such as DoWhy & ananke) recommend starting with a partial causal model is it a good idea to model only top few causes that have a strong correlation with the Outcome? Are there any pitfalls to this approach?

Here is an example:
I want to measure the causal impact of couple of key Weather related attributes on number of customers visiting a store. Weather forecast has several attributes such as:
‘date’, ‘week_no’, ‘temperature’, ‘dew_point’,
‘pressure’, ‘ground_pressure’, ‘humidity’, ‘clouds’, ‘wind_speed’,
‘wind_deg’, ‘rain’, ‘snow’, ‘ice’, ‘fr_rain’, ‘convective’,
‘snow_depth’, ‘accumulated’, ‘hours’, ‘rate’

If my data shows that Historically ‘week_no’, ‘temperature’, ‘dew_point’, ‘pressure’ have the strongest correlation to store traffic (number of customers visiting a store) is it alright to create a graphical model just using these?

In reality I have 23-30 additional causes apart from weather, which have to be added to the model.

f2harrell · January 17, 2021, 9:11pm

How do you know if the correlation is strong or not? It’s not proper to use the data to determine that.

Ramesh_Iyer · January 18, 2021, 1:59pm

Thanks for your reply! I have edited my post with an example. Please can you confirm your opinion based on that?

f2harrell · January 18, 2021, 2:16pm

You are using the data to select the model. Don’t.