Regression and Causal Inference: Which Variables Should Be Added to the Model?
Struggle and (Potential) Remedy
Often, the key purpose of data analysis is to provide insightful suggestions to some authority so that they can make changes to achieve a desirable outcome. For example, a supermarket owner may want to know whether offering a loyalty card to customers will increase sales. This type of question is relevant in academia and the public and private sectors. Inherently, such questions are “causal” because we try to understand the causal effect of implementing a policy/program on an outcome by asking them. And linear regression is the most common tool we use to answer such questions.
Regression is undoubtedly the most useful methodological tool I have learned in my academic life. However, since I was introduced to multiple regression, the most puzzling question was: Other than the explanatory variable of my research interest, what additional variables should I add to the model?
The applied statistics courses I took, both online (in Coursera and EdX) and in-person, focused on estimating models using linear regression for “predictive” purposes. In these introductory /intermediate courses, I learned to conduct partial F-tests for model building and variable selection. Simply put, we try two different models,
Full Model: Outcome = b0 + b1*Explanatory variable of Interest + b2*Additional Predictor + Error
Reduced Model: Outcome = b0 + b1*Explanatory variable of Interest + Error
Next, we do an F-test where,
the null hypothesis: b2=0 (models do not differ) and
the alternative hypothesis: b2 =/= 0 (models differ)
Suppose the p-value of the F-statistic is less than 0.05. In that case, we reject the null hypothesis in favor of the alternative hypothesis and select the full model (which means we keep the “Additional Predictor” in the model). On the contrary, if the p-value is greater than 0.05, we fail to reject the null hypothesis and select the reduced model (which means we drop the “Additional Predictor” from the model).
Although useful, this test does not provide a framework to answer my question. Should I do partial F-tests with whatever variable I find in a dataset and decide based on the p-value? If that is the case, how many variables should I try?
The instructors suggested adding variables based on theory/ subject matter knowledge. And I wondered what using theory/subject matter knowledge to select variables implies. Should I conduct a thorough literature review to identify all the predictors of the outcome, add them one by one to the model, and decide whether they belong to the model based on the p-value of the partial F-tests?
Fortunately, last year, I took a course on causal inference and learned about the “Backdoor Path Criterion,” which provides a framework to answer my question. Below, I will try to briefly describe how to determine the set of additional variables (also known as control variables) that we need to add to a model for identifying the causal effect of interest.
Step 1: Draw the Causal Graph
Let’s assume we are interested in identifying the causal effect of a treatment (X) on an outcome (Y). Suppose we are dealing with non-experimental data, at first. In that case, we have to figure out the possible paths through which statistical association can flow from X to Y. Basically, association from X to Y can flow in two paths: 1) causal paths and 2) non-causal paths.
If X has any causal effect on Y, then the graph would be:
It is also possible that X causes M and then M causes Y. In this case, M is a descendant of treatment X and is known as a Mediator.
Now, regardless of X having a direct/indirect causal effect on Y, an association can flow from X to Y through non-causal paths, such as backdoor paths.
For example, it is possible that both X and Y are caused by a common factor Z. Such a common factor is called a confounder. In the graph shown below, an association can flow from X to Y through the backdoor path X ← Z → Y.
Backdoor Path: Any path from X to Y with an arrow pointing towards X is a backdoor path
Let’s pretend that we did a thorough literature review and/or applied our subject matter knowledge and came up with the following complete graph showing all the causal and non-causal paths from X to Y:
Next, we identify all the backdoors (marked in red) through which “unwanted” association can flow from X to Y:
X ← Z →Y
X ← Z2 ← Z1 → Z →Y
X ← Z ← Z3 → Z4 →Y
Is that all?
No! Actually, there is another potential backdoor path from X to Y:
X ← Z2 ← Z1 → Z ← Z3 → Z4 → Y
In this path, Z is not a confounder, but a collider. A collider is a variable that is a common effect of two other variables; here, Z is the common effect of Z1 and Z3.
Interestingly, when a collider is not controlled for, the backdoor path remains blocked, i.e., no unwanted association flows through the path; but, if controlled for, the backdoor path becomes open.
Step 2: Satisfy the Backdoor Path Criterion
To identify the causal effect of X on Y, the backdoor path criterion says, we need to control for a set of variables which:
1. Contains no descendent of X
2. Blocks every backdoor path from X to T
So, now, we have finally found a framework to decide on which additional variables should be added to the model!
Let’s try to determine the set of additional variables (control variables):
- As M is a descendent of X, M cannot be part of the set.
- To block the backdoor path X ← Z →Y, we need to add Z to the set of control variables.
- To block the backdoor path X ← Z1 ← Z2 → Z →Y, we can add any one, two, or all of Z1, Z2, and Z to the set.
- To block the backdoor path X ← Z ← Z3 → Z4 →Y, we can add any one, two, or all of Z, Z3, and Z4 to the set.
Looking at points 2, 3, and 4 mentioned above, initially, we might assume that controlling for only Z is sufficient to block all three backdoor paths. However, as Z is a collider in the backdoor path X ← Z2 ← Z1 → Z ← Z3 → Z4 → Y, controlling for Z opens it.
Then, what is the solution?
We must control for Z as it is the only way to block the backdoor path X ← Z →Y. And, to block the newly opened backdoor path X ← Z2 ← Z1 → Z ← Z3 → Z4 → Y, we can control for either Z2, or Z1, or Z3, or Z4.
So, the minimum sufficient adjustment set (i.e., the set of variables to control for by adding them to the model) to satisfy the backdoor path criterion can be S= {Z, Z1}, or S={Z, Z2}, or S = {Z, Z3}, or S = {Z, Z4}.
To estimate the causal effect of X on Y, we can fit any of the following four models:
Y = b0 + b1*X + b2*Z + b3*Z1 + error
Y = b0 + b1*X + b2*Z + b3*Z2 + error
Y = b0 + b1*X + b2*Z + b3*Z3 + error
Y = b0 + b1*X + b2*Z + b3*Z4 + error
Obviously, if we have data on all of Z, Z1, Z2, Z3, and Z4, we can add all of them to the model as well:
Y= b0 + b1*X + b2*Z + b3*Z1 +b4*Z2 + b5*Z3 + b6*Z4+error
However, while working with non-experimental data (mostly from surveys), often, we do not have access to all the variables. So, the key question is: Do we have data on the minimum number of variables needed to satisfy the backdoor path criterion?
If yes, we can identify and estimate the causal effect of interest by using the multiple regression approach (technically known as selection on observables). Otherwise, we cannot depend on just a multiple regression and have to come up with an alternative strategy (e.g., difference-in-difference, instrumental variable, regression discontinuity, etc.).
If you would like to read some more beginner-level articles on causal inference, here are some suggestions:
Reference
Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal inference in statistics: A primer. John Wiley & Sons. http://bayes.cs.ucla.edu/PRIMER/