Cause selection while modeling a Graphical Causal Model

Hi There!

I have recently started toying around with Causal Models specifically the kinds from Judea Perl school of thought. In practical application however, I find that there are several numerous causes to an outcome. Since the approach (and libraries such as DoWhy & ananke) recommend starting with a partial causal model is it a good idea to model only top few causes that have a strong correlation with the Outcome? Are there any pitfalls to this approach?

Here is an example:
I want to measure the causal impact of couple of key Weather related attributes on number of customers visiting a store. Weather forecast has several attributes such as:
‘date’, ‘week_no’, ‘temperature’, ‘dew_point’,
‘pressure’, ‘ground_pressure’, ‘humidity’, ‘clouds’, ‘wind_speed’,
‘wind_deg’, ‘rain’, ‘snow’, ‘ice’, ‘fr_rain’, ‘convective’,
‘snow_depth’, ‘accumulated’, ‘hours’, ‘rate’

If my data shows that Historically ‘week_no’, ‘temperature’, ‘dew_point’, ‘pressure’ have the strongest correlation to store traffic (number of customers visiting a store) is it alright to create a graphical model just using these?

In reality I have 23-30 additional causes apart from weather, which have to be added to the model.

How do you know if the correlation is strong or not? It’s not proper to use the data to determine that.

Thanks for your reply! I have edited my post with an example. Please can you confirm your opinion based on that?

You are using the data to select the model. Don’t.

2 Likes