# Why Is Correlation Neither Necessary Nor Sufficient for Causation?

## A Detailed Explanation with Toy Examples

# I. Correlation is “not sufficient” for causation

## (i.e., two “significantly” correlated variables may not have a causal relationship)

In non-experimental data, despite the fact that —

- X and Y are
**strongly correlated/associated**(coefficient close to/equal to 1) - X and Y are
**(statistically) significantly associated**(p-value of the coefficient less than 0.05) - You can almost
**perfectly fit a straight line**through the scatter plot of X and Y - The explanatory variable in your model (X)
**explains almost all the variation**in the outcome variable (Y) (i.e., R² close to 1)

X and Y **may have no causal relationship**! Let’s look at two such scenarios:

# Scenario #1

**There are confounding variables (or simply, confounders) that your model did not control for.**

A famous example: *the ice cream sales in a month and the number of people attacked by sharks in the same month are strongly correlated, but there is no reason to believe one of these variables causes the other.*

I have written a few articles on this; therefore, I am not going to delve deeper into this here. If you are interested, read the following article:

# Scenario #2

**You controlled for a “collider,” a variable opposite to confounders. And guess what! Two variables — which neither cause each other nor are caused by a common factor — appear to be correlated. You induced bias by controlling for a collider, although your intention was to reduce bias by controlling for a confounder **🤦.

Here is a simple example to understand the issue (known as endogenous selection bias/collider bias):

# II. Correlation is “not necessary” for causation

## (i.e., two uncorrelated variables may have a causal relationship)

The first time I learned about it, I was like, “Good heavens 😲.”

Perhaps this fact is *even more unintuitive* to an ordinary human mind than the earlier one. In fact, I have heard some experienced people say that **“For one variable to cause another, there “must be” a correlation between them.”**

It turns out that the statement is not always correct!

Let’s look at *three distinct scenarios* where** there can be a causal relationship between two variables, although they are not correlated/statistically associated.**

**Scenario #1**

You are driving uphill. The slope is steep. There comes the point when you press the gas pedal harder and harder, but the car's speed remains the same.

Let’s make some arbitrary numbers and see the correlation between the force applied to the gas pedal and the speed of the car:

The correlation between force and speed is actually “undefined” here. Why? Because —

- The covariance between the two variables is 0
- The correlation between Force and Speed= (covariance between Force and Speed)/(standard deviation of Force)*(standard deviation of Speed)
- As the standard deviation of Speed = 0, we are dividing 0 by 0 in the correlation formula!

Okay, we have an “undefined correlation” between two variables; how can they be causally related? 🤔

# To identify causation between two variables, we need to invoke *“counterfactual thinking.”*

**Imagine yourself in two different worlds. Everything is exactly the same in the two worlds — the same car, the same steepness of the road, and the same you — except in one world, you keep exerting more force on the gas pedal, and in the other, you keep the force constant.**

Here are some more numbers to help us think counterfactually:

Once you invoke the counterfactual worlds, it becomes clear to you that:

# Although in the real world, there was no correlation between the force you exerted on the gas pedal and the speed of the car — i.e., change in one thing, apparently, was not associated with the change in the other — the force did have a causal effect on the speed of the car!

The other two scenarios are a bit more complicated!

Usually, in most social, behavioral, and health science research, we are interested in a treatment’s “average causal effect (ACE)” on an outcome. We estimate that using a linear model. Linear models — in many circumstances — work as a useful approximation of a complicated real-world phenomenon.

**Are there situations where the preference for a linear model and the reliance on ACE is not a good idea? **🤔

# Scenario #2

Let’s pretend **“I want someone to love me”**.

In other words, I want to know *the causal effect of being loved by that someone on my suffering*.

Love in this world can be measured using a scale that stretches from -5 to +5, in which **-5 is a complete lack of love,** **0 is a neutral point** *(a point of non-attached love, i.e., both in love and not in love),* and **+5 is consummate love**.

Suffering can be measured using a scale that stretches from 0 to 25 in which 0 means a complete lack of suffering *(something like nirvana)* and 25 means utmost suffering *(something like the eternal hell).*

This is how it works in this “hypothetical” world*:*

- The less they love me, the more I suffer
*(i.e., as love moves from 0 to -5, suffering increases from 0 to 25)* - The more they love me, the more I suffer
*(i.e., as love moves from 0 to +5, suffering increases from 0 to 25)* - I suffer the least
*(i.e., reach my Nirvana)*when their love is at the neutral point*(i.e., I am located at the point where love=0 and suffering =0)*

Clearly, there is a causal effect of their love on my suffering. However, the correlation coefficient is 0 *(because the correlation coefficient shows the strength of the linear relationship between two variables)*. Also, the coefficient of Love — estimated using a simple linear regression (ordinary least squares) — is 0 *(because it found that a horizontal line is the best fit implying the slope=coefficeint=0)*.

If we estimate a simple linear model, based on the coefficient, we conclude there is no relationship between their love and my suffering.

The *(purported)* true relationship is:

# Suffering = (Love)²

But for the sake of using linear regression, I estimated:

# Suffering = b0 + b1*Love + u

One may say it is indeed the case that, on average, there is no relationship between the two variables. But, we may argue: **why should we settle for the average and ignore the entire picture?**

# Scenario #3

The last scenario is quite intriguing as well!

Let’s pretend you are running an experiment. You randomly assign your study participants into two groups — a treatment group (which receives a treatment X) and a control group (which receives no treatment). You are interested in the causal effect of treatment X on outcome Y.

The participant pool (N=8) has two types of people — green people (N=4) and red people (N=4). Because they are randomly assigned into two equally sized groups, each group — thanks to randomness — (somehow) has two green people and two red people.

This is how this hypothetical world works: *treatment X has a positive effect (+2 units) on the Red group; however, the same treatment has an equally negative effect (-2 units) on the Green group.*

As a consequence, the correlation between treatment X and outcome Y is 0. The average causal effect (ACE) — estimated by simple linear regression — is also 0.

This toy example shows that due to **treatment effect heterogeneity** — *even though you ran an experiment and the treatment, indeed, affects the outcome *— if you base your judgment only on the correlation coefficient and/or the coefficient of the predictor of a simple linear model *(i.e., you do not have a theory of the possible heterogeneity between groups)*, you will conclude that the treatment has no effect on the outcome!

# To conclude, the above examples are quite extreme; nevertheless, they serve a useful purpose by highlighting some worst-case possibilities. Rather than keeping our eyes closed and pretending everything will work out just fine, an awareness of these issues can be extremely helpful for a credible investigation of a real-world phenomenon and useful knowledge generation.

*Footnotes:*

- Can you look at the data and decide whether a linear or non-linear model should be fitted? Unfortunately, if all you have is non-experimental data, going from data to model may not be a good idea. Think about the ice cream sales and shark attacks case. If you create a scatter plot, the two variables seem linearly related. That, just by itself, tells you either nothing or something completely wrong about how the world works. You need a theory first.
- A key difference between experimental and non-experimental data is that the earlier (in theory) automatically generates counterfactual worlds
*(actually, I never realized it until I made an exodus to applied social sciences)*. In high school physics, we learned about Ohm’s law which says that the current through a conductor between two points is directly proportional to the voltage across the two points. And we saw it first-hand by running experiments. We increased voltage across two points, and the flow of current increased proportionately, resulting in a linear relationship. But remember two key things: we were running the experiment on (1) the same conductor, and (2) we kept the temperature constant. This implies we changed only one thing — the voltage across the two points. In non-experimental data, say survey data on income and years of education, we may have 10,000 observations — but crucially, these are observations of different people. These people have not only different years of education and different income but also many other different characteristics, many of which affect both education and income. This is very different from Ohm’s law situation in which we observed the same conductor repeatedly. I do not quite understand why so many people plot two variables from non-experimental data (e.g., income and happiness), find a linear/quadratic relationship, and interpret the relationship as if it were as clear as Ohm’s law. 😔 I wrote more on this in the following article:

In case you would like to read some more articles on the fundamentals of causal inference, here are some suggestions: