Endogenous Selection/Collider Bias: Another Key Issue in Causal Inference

Are you analyzing the correct sample?

Vivekananda Das
4 min readJan 4, 2021
Photo by Tengyart on Unsplash

Endogenous selection bias emerges when we condition on a collider. A collider is a variable caused by two other variables, e.g., a model’s independent variable (treatment) and the dependent variable (outcome). This bias is also known as “collider bias.”

A Simple Example

In a hypothetical country, in most cases, top software companies hire employees who either have a very high college GPA or have won many prizes in coding contests during their college education.

Let’s assume that GPA and winning prizes in coding contests are uncorrelated in the true population of interest, which consists of all college graduates. To elaborate on the assumption, on average, getting a higher GPA neither increases nor decreases the number of prizes won in coding contests. The reverse is also true, i.e., an increase in the number of prizes won in coding contests, on average, neither increases nor decreases GPA. Finally, these two variables are not caused by any common factor.

Causal relationship in the true population. As the treatment (GPA) and the outcome (Number of Prizes Won in Coding Contests) are uncorrelated, no arrow exists between them.

An analyst wants to investigate the causal effect of GPA on the number of prizes won in coding contests. He gathers data on the two variables for the employees currently working at top software companies.

After gathering the data, the following model is estimated:

Number of Prizes Won = b0 + b1*GPA + error

Guess what! The analyst finds a statistically significant “negative” coefficient (b1) of GPA (although GPA and the number of prizes won are uncorrelated in the population, as we assumed) and wrongly concludes that a higher GPA leads to a lower number of prizes won in coding contests.

Due to conditioning on the collider, a spurious association arises between the treatment and the outcome (shown by the dashed line). I drew the graph using Dagitty (http://www.dagitty.net/dags.html#).

So, what did go wrong?

In this example, “Employment at a Top Software Company” is a collider: a variable caused by both the treatment (GPA) and the outcome (Number of Prizes Won in Coding Contests).

Why do we find a negative association here?

Here is the logic: given someone is working at a top software company,

1) If the person has a very high GPA, it is highly likely that they did not win many prizes

2) On the contrary, if the person has a lower GPA, it is highly likely that they won many prizes

So, for this sub-sample, given we know someone’s GPA, we can make a pretty good prediction about the number of prizes won by that person in coding contests and vice-versa.

So, in this example, the analyst conditioned on a collider by running the analysis on a sub-sample of college graduates employed at top software companies. Hopefully, this example illustrates how important the sample selection is for a credible causal inference!

If we get a non-random sample (e.g., a convenience sample) from the population of interest, there exists a risk of running into this bias.

Sometimes, we get a random sample from the population, but we decide to run the analysis on a subset of respondents by conditioning on the outcome. For example, let’s say that we are interested in the effect of education on income and investigate the relationship among lower-income people (which is a classic example of conditioning on an outcome). Again, such a strategy can lead to endogenous selection bias.

Other than sample selection issues, this bias may arise if we add a collider variable in our empirical model as if it were a confounding variable. For a quick introduction to cofounding variables, please read the following article:

Ultimately, theoretical knowledge is the key to conducting a proper causal analysis. The analyst must clearly understand the relationships among all the variables considered (and the ones that are relevant but not considered due to lack of data availability) in the analysis.

If you are interested in reading more introductory articles on causal inference, try some of my other articles.

Reference

Elwert, F., & Winship, C. (2014). Endogenous selection bias: The problem of conditioning on a collider variable. Annual review of sociology, 40, 31–53. https://www.annualreviews.org/doi/abs/10.1146/annurev-soc-071913-043455

--

--

Vivekananda Das
Vivekananda Das

Written by Vivekananda Das

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Well-being | Assistant Prof @ UUtah | More: https://vivekanandadas.com

No responses yet