Confounding Variable and Spurious Correlation: Key Challenge in Making Causal Inference
The desire to solve problems is perhaps natural to all humans. The inability to identify the causes of a problem, particularly in the cases of issues relevant to our personal and social lives, creates some kind of discomfort within our minds. Regardless of the difficulty of the problem and our expertise in the area, often, we come up with some cause and effect relationship (change in X causes change in Y), and then propose a solution such as: “by changing X (the cause), we can change Y (the outcome/effect).”
Whether in our personal or professional lives, one of the ways we attempt to identify the causes of a problem is by finding events that concurrently happen with the problem of interest. Of course, so many things happen around us all the time — some being more easily noticeable than others — but we tend to over-emphasize the events with immediate availability and visibility. Psychologists call this phenomenon “Availability Heuristics”.
For example, in a particular neighborhood, if the number of crimes increases over a period of time and a demographic shift takes place (a more visible change) within the same time period, people may start assuming that the change in the demography resulted in the increase in criminal activities. But, is the apparent evidence enough to make such a strong causal inference?
In this article, we will try to illustrate how the making of causal inference based on a simple bivariate correlation/association can go horribly wrong in the presence of a confounding variable.
Confounding Variable and Spurious Correlation
A “Confounding Variable” (also known as a “Confounder”) is a variable that simultaneously causes the predictor/independent variable/explanatory variable/treatment of interest and the outcome.
Due to the presence of a confounder (i.e., a common cause of the treatment and the outcome), we may stumble on a “Spurious Correlation” between the treatment and the outcome, although they do not have any causal relationship; and in case they do have any causal relationship, we may over or under-estimate it.
To simplify the concepts, let us look at a hypothetical example (popular but bizarre one :D)!
A Hypothetical Example
The local government of the city of “Fancy Beach” is trying to solve the problem of shark attacks on swimmers. A data analyst has been appointed to do a data-driven analysis.
The analyst finds a dataset that, among many other variables, has data on the monthly count of shark attacks. He starts investigating the bivariate relationships between the monthly count of shark attacks and other variables one at a time, and suddenly comes up with the following results:
The above graph shows a positive association between the Monthly Ice Cream Sales and the Monthly Shark Attacks. In fact, for a particular month, the analyst can predict the count of shark attacks fairly accurately given he knows the amount of ice cream sold in that month.
But, does this analysis have any practical utility? Not really unless the analyst is trying to become a soothsayer!
Although the analyst can make a good prediction, what policy advice would he provide to the city authority? Yes, the model does say that an increase in ice cream sales is positively associated with an increase in shark attacks; however, it would be a piece of absurd policy advice if the analyst suggests that by decreasing ice cream sales the city can decrease the number of shark attacks.
So, what is going on in here? Let us look at another graph.
This graph illustrates the reason behind the positive association between the two variables. During the hottest month in the city (month 7 = July), both shark attacks and ice cream sales become the highest. In general, an increase in temperature means more people visit sea beaches, which possibly explains the increase in the count of shark attacks on swimmers. Also, an increase in temperature makes people purchase more ice creams.
Overall, we have the following situation in this case:
Although statistically, we find a strong positive association between monthly ice cream sales and monthly count of shark attacks, in reality, they do not have a causal relationship; in other words, forcing a change in one of these variables would not cause a change in the another. The observed positive association can be attributed to the change in temperature, which is the confounding variable.
Okay, But, What is the Remedy?
Because a simple bivariate analysis cannot take into account the presence of a confounding variable, in the statistical analysis, we need to consider more variables. After controlling for enough confounders to block all the backdoor paths between the treatment and the outcome (e.g., Ice Cream Sales ← Higher Temperature → Shark Attacks is the only backdoor path in the above example), if we observe a statistically significant correlation between the treatment and the outcome, only then we should make a causal inference from the results of the analysis (actually making a valid causal inference requires some more assumptions, but let’s keep things simple for now).
How do we control for the effects of confounders? The most reliable approach is to conduct a randomized controlled experiment. If conducting such an experiment is not feasible, we can gather observational data (from natural experiments/surveys), do multivariable regression analysis (this is the most preferred method but certainly not the only one), and make some reasonable theoretical assumptions to make causal inferences.
Before we end, it is worth mentioning another type of variable — known as a “collider” — which, unlike a confounder, should not be controlled for! In fact, controlling for colliders can induce bias (endogenous selection bias) in an analysis. For a quick review of colliders, do visit my article on it:
Endogenous Selection Bias: Another Key Issue in Causal Inference
Caution: Try not to condition on a collider!
If you are interested in the R Code (confounding.R) and the fake dataset (confounding.csv) used in the example, feel free to visit: https://github.com/vivekbd92/statblogs