Confounding Variable and Spurious Correlation: Key Challenge in Making Causal Inference
The desire to solve problems is natural to all humans. The inability to identify the causes of a problem — particularly in issues relevant to our personal, professional, and social lives — creates some discomfort within our minds. Regardless of the difficulty of the problem and our expertise in the area, often, we come up with some cause-and-effect relationship (a change in X leads to a change in Y) and then propose a solution such as: “by changing X (the cause), we can change Y (the effect).”
Often we attempt to identify the causes of a problem by finding events that concurrently happen with the problem of interest. Of course, so many things happen around us all the time — some being more easily noticeable than others — but we tend to over-emphasize the events with immediate availability and visibility. Psychologists call this phenomenon “Availability Heuristics.”
For example, in a particular neighborhood, if the number of crimes increases over time and a demographic shift occurs (a more visible change) within the same period, people may start assuming that the change in the demography resulted in the increase in criminal activities. But is the apparent evidence enough to make such a strong causal inference?
In this article, I illustrate how making causal inferences based on a simple bivariate correlation/association can go horribly wrong in the presence of a confounding variable.
Confounding Variable and Spurious Correlation
A “Confounding Variable” (also known as a “Confounder”) is a variable that simultaneously causes the predictor/independent variable/explanatory variable/treatment (X) and the dependent variable/outcome (Y).
Due to the presence of a confounder (i.e., a common cause of the treatment and the outcome), we may stumble on a “Spurious Correlation” between the treatment and the outcome, although they do not have any causal relationship. If they do have any causal relationship, we may over or under-estimate it.
To simplify the concepts, let us look at a hypothetical example (popular but bizarre one 😃).
A Hypothetical Example
The local government of the city of “Fancy Beach” is trying to solve the problem of shark attacks on swimmers. A data analyst has been appointed to do a data-driven analysis.
The analyst finds a dataset that, among many other variables, has data on the monthly count of shark attacks. He starts investigating the bivariate relationships between the monthly count of shark attacks and other variables one at a time and suddenly comes up with the following results:
The above graph shows a positive association between Monthly Ice Cream Sales and Monthly Shark Attacks. In fact, the analyst can predict the count of shark attacks for a particular month fairly accurately, given he knows the amount of ice cream sold that month.
But does this analysis have any practical utility? Not really, unless the analyst is trying to become a soothsayer.
Although the analyst can make a good prediction, what policy advice would he provide to the city authority? Yes, the model does say that an increase in ice cream sales is positively associated with an increase in shark attacks; however, it would be a piece of absurd policy advice if the analyst suggests that by decreasing ice cream sales, the city can decrease the number of shark attacks.
So, what is going on here? Let us look at another graph.
This graph illustrates the reason behind the positive association between the two variables. During the hottest month in the city (month 7 = July), shark attacks and ice cream sales become the highest. In general, an increase in temperature means more people visit sea beaches, possibly explaining the increase in the number of shark attacks on swimmers. Also, an increase in temperature makes people purchase more ice creams.
Overall, we have the following situation in this case:
Although we find a robust positive association between monthly ice cream sales and the monthly count of shark attacks, they do not have a causal relationship; in other words, forcing a change in one of these variables would not cause a change in the other. The observed positive association can be attributed to the change in temperature, which is the confounding variable.
Okay, But, What is the Remedy?
Because a simple bivariate analysis cannot take into account the presence of a confounding variable, we need to consider more variables in the statistical analysis. After controlling for enough confounders to block all the backdoor paths between the treatment and the outcome (e.g., Ice Cream Sales ← Higher Temperature → Shark Attacks is the only backdoor path in the above example), if we observe a statistically significant correlation between the treatment and the outcome, only then we should make a causal inference from the results of the analysis. In reality, making a valid causal inference requires some more assumptions, but let’s keep things simple for now.
How do we control for the effects of confounders? The most reliable approach is to conduct a randomized controlled experiment. If conducting such an experiment is not feasible, we can gather observational data (from natural experiments/surveys), do multivariable regression analysis (this is the most preferred method but certainly not the only one), and make some reasonable theoretical assumptions to make causal inferences.
Before we end, it is worth mentioning another type of variable called “collider,” which should NOT be controlled for. In fact, controlling for colliders can induce bias (endogenous selection bias) in an analysis. For a quick introduction to colliders, read the following article: