Regression and Causal Inference: Which Variables Should Be Added to the Model?

Struggle and (Potential) Remedy

7 min readSep 5, 2021

Often, the key purpose of data analysis is to provide insightful suggestions to some authority so that they can make changes to achieve a desirable outcome. For example, a supermarket owner may want to know whether offering a loyalty card to customers will increase sales. This type of question is relevant in academia and the public and private sectors. Inherently, such questions are “causal” because we try to understand the causal effect of implementing a policy/program on an outcome by asking them. And linear regression is the most common tool we use to answer such questions.

Regression is undoubtedly the most useful methodological tool I have learned in my academic life. However, since I was introduced to multiple regression, the most puzzling question was: Other than the explanatory variable of my research interest, what additional variables should I add to the model?

The applied statistics courses I took, both online (in Coursera and EdX) and in-person, focused on estimating models using linear regression for “predictive” purposes. In these introductory /intermediate courses, I learned to conduct partial F-tests for model building and variable selection. Simply put, we try two different models,

Full Model: Outcome = b0 + b1*Explanatory variable of Interest + b2*Additional Predictor + Error

Reduced Model: Outcome = b0 + b1*Explanatory variable of Interest + Error

Next, we do an F-test where,

the null hypothesis: b2=0 (models do not differ) and

the alternative hypothesis: b2 =/= 0 (models differ)

Suppose the p-value of the F-statistic is less than 0.05. In that case, we reject the null hypothesis in favor of the alternative hypothesis and select the full model (which means we keep the “Additional Predictor” in the model). On the contrary, if the p-value is greater than 0.05, we fail to reject the null hypothesis and select the reduced model (which means we drop the “Additional Predictor” from the model).

Although useful, this test does not provide a framework to answer my question. Should I do partial F-tests with whatever variable I find in a dataset and decide based on the p-value? If that is the case, how many variables should I try?

The instructors suggested adding variables based on theory/ subject matter knowledge. And I wondered what using theory/subject matter knowledge to select variables implies. Should I conduct a thorough literature review to identify all the predictors of the outcome, add them one by one to the model, and decide whether they belong to the model based on the p-value of the partial F-tests?

Fortunately, last year, I took a course on causal inference and learned about the “Backdoor Path Criterion,” which provides a framework to answer my question. Below, I will try to briefly describe how to determine the set of additional variables (also known as control variables) that we need to add to a model for identifying the causal effect of interest.

Step 1: Draw the Causal Graph

Let’s assume we are interested in identifying the causal effect of a treatment (X) on an outcome (Y). Suppose we are dealing with non-experimental data, at first. In that case, we have to figure out the possible paths through which statistical association can flow from X to Y. Basically, association from X to Y can flow in two paths: 1) causal paths and 2) non-causal paths.

If X has any causal effect on Y, then the graph would be:

**Figure 1*:** A Direct Causal Path from X to Y

It is also possible that X causes M and then M causes Y. In this case, M is a descendant of treatment X and is known as a Mediator.

**Figure 2*:** An indirect Causal Path from X to Y. Here, M is a mediator.

Now, regardless of X having a direct/indirect causal effect on Y, an association can flow from X to Y through non-causal paths, such as backdoor paths.

For example, it is possible that both X and Y are caused by a common factor Z. Such a common factor is called a confounder. In the graph shown below, an association can flow from X to Y through the backdoor path X ← Z → Y.

**Figure 3*:** If not controlled for, Z confounds the causal relationship between X and Y as X ← Z →Y backdoor path is open. The resulting bias is known as confounding bias/omitted variable bias.

Backdoor Path: Any path from X to Y with an arrow pointing towards X is a backdoor path

Let’s pretend that we did a thorough literature review and/or applied our subject matter knowledge and came up with the following complete graph showing all the causal and non-causal paths from X to Y:

**Figure 4*:** The hypothesized complete graph

Next, we identify all the backdoors (marked in red) through which “unwanted” association can flow from X to Y:

X ← Z →Y

X ← Z2 ← Z1 → Z →Y

X ← Z ← Z3 → Z4 →Y

Is that all?

No! Actually, there is another potential backdoor path from X to Y:

X ← Z2 ← Z1 → Z ← Z3 → Z4 → Y

In this path, Z is not a confounder, but a collider. A collider is a variable that is a common effect of two other variables; here, Z is the common effect of Z1 and Z3.

Interestingly, when a collider is not controlled for, the backdoor path remains blocked, i.e., no unwanted association flows through the path; but, if controlled for, the backdoor path becomes open.

Step 2: Satisfy the Backdoor Path Criterion

To identify the causal effect of X on Y, the backdoor path criterion says, we need to control for a set of variables which:

1. Contains no descendent of X

2. Blocks every backdoor path from X to T

So, now, we have finally found a framework to decide on which additional variables should be added to the model!

Let’s try to determine the set of additional variables (control variables):

As M is a descendent of X, M cannot be part of the set.
To block the backdoor path X ← Z →Y, we need to add Z to the set of control variables.
To block the backdoor path X ← Z1 ← Z2 → Z →Y, we can add any one, two, or all of Z1, Z2, and Z to the set.
To block the backdoor path X ← Z ← Z3 → Z4 →Y, we can add any one, two, or all of Z, Z3, and Z4 to the set.

Looking at points 2, 3, and 4 mentioned above, initially, we might assume that controlling for only Z is sufficient to block all three backdoor paths. However, as Z is a collider in the backdoor path X ← Z2 ← Z1 → Z ← Z3 → Z4 → Y, controlling for Z opens it.

Then, what is the solution?

We must control for Z as it is the only way to block the backdoor path X ← Z →Y. And, to block the newly opened backdoor path X ← Z2 ← Z1 → Z ← Z3 → Z4 → Y, we can control for either Z2, or Z1, or Z3, or Z4.

So, the minimum sufficient adjustment set (i.e., the set of variables to control for by adding them to the model) to satisfy the backdoor path criterion can be S= {Z, Z1}, or S={Z, Z2}, or S = {Z, Z3}, or S = {Z, Z4}.

To estimate the causal effect of X on Y, we can fit any of the following four models:

Y = b0 + b1*X + b2*Z + b3*Z1 + error

Y = b0 + b1*X + b2*Z + b3*Z2 + error

Y = b0 + b1*X + b2*Z + b3*Z3 + error

Y = b0 + b1*X + b2*Z + b3*Z4 + error

Obviously, if we have data on all of Z, Z1, Z2, Z3, and Z4, we can add all of them to the model as well:

Y= b0 + b1*X + b2*Z + b3*Z1 +b4*Z2 + b5*Z3 + b6*Z4+error

However, while working with non-experimental data (mostly from surveys), often, we do not have access to all the variables. So, the key question is: Do we have data on the minimum number of variables needed to satisfy the backdoor path criterion?

If yes, we can identify and estimate the causal effect of interest by using the multiple regression approach (technically known as selection on observables). Otherwise, we cannot depend on just a multiple regression and have to come up with an alternative strategy (e.g., difference-in-difference, instrumental variable, regression discontinuity, etc.).

If you would like to read some more beginner-level articles on causal inference, here are some suggestions:

How to Explore the Effect of Doing Something? (Part 1)

Applied Causal Inference 101: Counterfactual Worlds and The Experimental Ideal

vivdas.medium.com

How to Explore the Effect of Doing Something? (Part 2)

Applied Causal Inference 101: Non-Experimental Data

vivdas.medium.com

Confounding Variable and Spurious Correlation: Key Challenge in making Causal Inference

Desire to solve problems is perhaps natural to all humans. Inability to identify the causes of a problem, particularly…

vivdas.medium.com

Endogenous Selection Bias: Another Key Issue in Causal Inference

Caution: Try not to condition on a collider!

vivdas.medium.com

Regression and Causal Inference: How Causal Graphs Helped Me Overcome 3 Key Misconceptions

Regression can be used for two purposes: predictive modeling and causal inference. However, based on my experience and…

vivdas.medium.com

Reference

Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal inference in statistics: A primer. John Wiley & Sons. http://bayes.cs.ucla.edu/PRIMER/

Regression and Causal Inference: Which Variables Should Be Added to the Model?

Struggle and (Potential) Remedy

Step 1: Draw the Causal Graph

Backdoor Path: Any path from X to Y with an arrow pointing towards X is a backdoor path

Interestingly, when a collider is not controlled for, the backdoor path remains blocked, i.e., no unwanted association flows through the path; but, if controlled for, the backdoor path becomes open.

Step 2: Satisfy the Backdoor Path Criterion

To identify the causal effect of X on Y, the backdoor path criterion says, we need to control for a set of variables which:

1. Contains no descendent of X

2. Blocks every backdoor path from X to T

How to Explore the Effect of Doing Something? (Part 1)

Applied Causal Inference 101: Counterfactual Worlds and The Experimental Ideal

How to Explore the Effect of Doing Something? (Part 2)

Applied Causal Inference 101: Non-Experimental Data

Confounding Variable and Spurious Correlation: Key Challenge in making Causal Inference

Desire to solve problems is perhaps natural to all humans. Inability to identify the causes of a problem, particularly…

Endogenous Selection Bias: Another Key Issue in Causal Inference

Caution: Try not to condition on a collider!

Regression and Causal Inference: How Causal Graphs Helped Me Overcome 3 Key Misconceptions

Regression can be used for two purposes: predictive modeling and causal inference. However, based on my experience and…

Written by Vivekananda Das

Responses (1)