Regression and Causal Inference: How Causal Graphs Helped Me Overcome 3 Key Misconceptions

Vivekananda Das
6 min readJun 11, 2021

--

Photo by Nick Fewings on Unsplash

Regression analysis is mainly used for predictive modeling and causal inference. However, based on my experience and understanding, introductory/intermediate statistics/data science courses focus almost entirely on applying regression for predictive purposes. This makes sense in terms of preparing students for a successful career in the data science industry. After all, under relatively stable real-world conditions, as long as you can predict the value of Y based on the value of X (usually some Xs) with a reasonable amount of precision, you may not have much incentive to investigate whether X causes Y, or Y causes X, or a common factor Z causes both X and Y.

Things get complicated when we try to interpret the coefficients of the predictors of a statistical model — estimated by disregarding the underlying causal relationships among the relevant variables — in causal terms. Such practices can lead to a continued misunderstanding of a phenomenon and ineffective/less effective policy suggestions.

Perhaps, the introductory/intermediate statistics/data science courses can do more to provide a fundamental (yet deeper) understanding of causal inference along with the knowledge of statistical inference. In fact, some of the complicated concepts can be easily explained by using causal graphs, also known as Directed Acyclic Graphs or DAGs, as shown in the book “Causal Inference in Statistics: A Primer” by Pearl, Glymour, and Jewell.

In this article, I explain how a basic understanding of causal graphs helped me overcome three misconceptions about causal inference based on regression analysis.

Example

Let’s assume a researcher is trying to investigate the causal effect of a treatment X on an outcome Y. She estimates the following model using some non-experimental data:

Y = b0 + b1 * X + error

Misconception #1: You need to control for all other causes of the Outcome (Y) other than the Treatment (X)

Somehow, many people think that to estimate the causal effect of a treatment X on an outcome Y, you need to control for all other causes of Y; and so, if you have not added all these variables to the model, you cannot suggest that the coefficient of X (b1) is an estimator of the causal effect of X on Y.

In reality, for causal inference, in our model, we need to add only the common causes of the treatment and the outcome. These variables are called “confounding variables” or simply “confounders.”

In our example, let’s pretend that there are only two common causes (A and B) of the treatment and the outcome, and the causal relationships among the variables of interest can be shown in the following causal graph:

Graph I: Illustrating Confounders

In Graph I, e represents all other causes of the outcome Y except A, B, and X. Controlling for A and B (i.e., adding A and B to the model as predictors along with X) closes the two non-causal paths (shown in red; also known as backdoor paths) from X to Y: X ← A →Y and X ← B →Y, and only the causal path XY remains open for information to flow from X to Y.

As long as our researcher has data on A, B, X, and Y, she can estimate the causal effect of X on Y (even if she does not have any data on e) from the following model:

Y = b0 + b1 * X + b2 * A + b3* B + e (where, e is the error term)

Given the causal graph is correct, in the above model, b1 is a consistent estimator of the causal effect of X on Y.

Of course, we need to make more theoretical assumptions — for example, stable unit treatment value assumption (SUTVA), positivity, treatment effect homogeneity, etc. — for valid causal inference from a model estimated using observational data. For now, we will keep things simple here and focus only on how causal graphs can guide our selection of variables to be included in the model.

Misconception #2: Throw the kitchen sink at your model!

Sometimes, people suggest you add more, and more, and more…….., and more “control variables” to your model. Apparently, the assumption is that all these control variables are confounders, and if you do not add all of them, you cannot estimate the causal effect.

But are all of these control variables really confounders? If they are, yes, you can add them to your model. However, you need to be cognizant of two other types of variables — mediators and colliders — controlling for which can make your causal estimate inconsistent.

Graph II: Illustrating Mediators

In Graph II, there are three causal paths from X to Y: two indirect paths (X → A → Y and X → B → Y) and a direct path (XY). If this causal graph is correct, then A and B are “mediators” (not confounders), and controlling for either A or B will lead to a biased and inconsistent estimation of the total causal effect of X on Y.

Graph III: Illustrating Colliders

Another possibility, as shown in Graph III, is that both A and B are common effects of X and Y. In this case, A and B are called “colliders,” and interestingly, if we control for nothing, the non-causal paths (shown in yellow) X → A ← Y and X → B ← Y are automatically closed (that is, no information flows through these paths). However, if we control for A and B, the paths are opened, and we are highly likely to get a biased and inconsistent estimate of the causal effect of X on Y.

As a simple rule of thumb, we may remember: if we are interested in estimating the total causal effect of a treatment, we should control for variables that exist before treatment assignment (likely confounders) and should not control for variables that exist after treatment assignment (likely mediators and colliders). So, knowledge of temporal precedence can be helpful in deciding on which variables to control for.

However, there are situations where the above-mentioned rule will not work (for example, in the M-bias case). The problem is that whether a variable is a confounder or a mediator or a collider depends on the path in which it is located; that is, the same variable can be a confounder in one path and a collider/a mediator in another path!

The best idea is to figure out all the causal and non-causal paths from X (treatment) to Y (outcome) and then, depending on the situation, decide on which variables to control for in the model. I discuss the issue succinctly in another blog:

Misconception #3: You need to control for all the confounders

Earlier, we mentioned that common causes of the treatment and the outcome are confounders. Now, let’s look at the following causal graph:

Graph IV

In Graph IV, the two non-causal paths between X and Y are: X← A →Y and X← A ← B →Y. We may claim that both A and B are confounders as they are common causes of X and Y. Interestingly, in this case, controlling for only A is sufficient to close both non-causal paths; in other words, even if we do not have any data on the variable B, we can still estimate the causal effect of X on Y by controlling for A. The reason is: controlling for A closes both non-causal paths (shown in red), which means information from X to Y can flow only through the causal path (shown in green). The key takeaway point from this example is that to close all the non-causal paths, we may not always need data on all the confounding variables.

If you are interested in getting more intuition on confounders and colliders, read the following!

--

--

Vivekananda Das
Vivekananda Das

Written by Vivekananda Das

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Well-being | Assistant Prof @ UUtah | More: https://vivekanandadas.com

Responses (3)