Analyzing Data in Social Science vs. Engineering: Key Differences to Keep In Mind
A neophyte’s self-reflection
This week, as I am wrapping up another semester in the applied social sciences, I have been revisiting some of the key challenges I faced over the last thirty months while transitioning from engineering to social science.
In this article, I discuss some key differences between typical engineering data and social data, and how these differences require different approaches for modeling and interpreting the results.
[Before we begin: neither all engineering data are experimental, nor all social data are non-experimental. Also, engineers do not always deal with data that follows from a law/theory. I am using simple (and perhaps too stereotypical) examples here for the sake of explaining the key points of this article]
Let’s begin with Ohm’s law, a classic physics experiment that many of us conducted in high school. Let’s revisit the law:
Ohm’s law states that the current through a conductor between two points is directly proportional to the voltage across the two points.
Introducing the constant of proportionality, the resistance, one arrives at the usual mathematical equation that describes this relationship:
I = V/R
where I is the current through the conductor, V is the voltage measured across the conductor, and R is the resistance of the conductor
(Copied from Wikipedia)
Here are a few things to note from a data analyst’s perspective:
- The unit of analysis is “one conductor”
- When you are running the experiment, you are changing voltage across two points of the same conductor
- The conductor remains the same throughout the experiment
- You change voltage (independent/explanatory/treatment variable)
- You observe a change in the flow of current (dependent/outcome variable)
- You (perhaps unconsciously) assume that, for this conductor, at a particular voltage (e.g., 0.8 V), the observed current (e.g., 0.38 A) would be the same regardless of when you measure it. To elaborate on this, let’s say, you ran an experiment in which you gradually increased voltage in this sequence: 0V, 1V, and 2V, and observed current in these sequence: 0A, 0.49A, and 0.98A. You are assuming that had you run the experiment in the reverse order, for example, in this sequence: 2V, 1V, and 0V, you would have observed current in these sequence 0.98A, 0.49A, and 0A. To summarize, you are assuming that there are pairs of voltage and current — for example, (OV and 0A), (1V and 0.49A), (2V and 0.98A) — that do not change depending on the sequence of experimentation.
Next, you create a plot like this:
- Because this data is experimental, you know the treatment variable (i.e., voltage) and the outcome variable (i.e., current). By treatment variable, we refer to the cause, and by outcome variable, we refer to the effect
- Working with data like the Ohm’s law one is a massive privilege — you are not stuck in a situation in which you have to ponder over whether X causes Y, or Y causes X, or both X and Y are caused by Z. Here, you clearly know that the change in voltage is the cause (treatment), and the change in current is the effect (outcome)
- You can make a counterfactual claim, such as “Had I not changed the voltage, the current wouldn’t have changed”
- Also, you can make a causal claim, such as “A change in voltage across the conductor causes a change in current through the conductor”
- There is reasonable homogeneity, i.e., whether you run the experiment in the U.S. in 1923 or Bangladesh in 2023, for the same type of conductor, you will get the same numbers, and they will fall on a line, ignoring negligible noise introduced by measurement errors, power loss, etc.
For this conductor, you can describe the relationship between current and voltage as follows:
I = 0.49 * V
So, your key conclusion would be: with a 1 unit increase in voltage across the conductor, the current through the conductor increases by 0.49 units.
Note that in the Ohm’s law data, we already had a law (I=V/R) which we tested using empirical data. Unfortunately, given the enormous complexity of the social world, often, there isn’t a robust law /theory that we can test using empirical data. Sometimes we do the opposite, i.e., using empirical data, we try to understand social laws/propose social theories. To make things even more complicated, these social laws/theories may change across time and/or societies and require frequent updates.
Modeling Like an Engineer, but the Data is Social
Not all engineering datasets are as smooth as the Ohm’s law data. However, in general, they tend to be less messy than social data, which facilitates our modeling of the relationships between/among variables using mathematical formulae.
Sometimes, while modeling these relationships, someone trained as an engineer may disregard the underlying data-generating processes and focus too much on curve fitting (i.e., try to come up with a mathematical expression that can best fit the data).
Importantly, social data tend to be non-experimental more often than engineering data. The curve fitting practices, which work perfectly for experimental data, may not be the best idea for non-experimental data (especially if you are trying to propose a change to achieve a desirable outcome).
Let’s pretend I am working with a dataset that has data on the monthly spending by customers at a store (in $) and their satisfaction with the store (measured on a scale from 1 to 10). I create a scatter plot that looks like this:
This plot isn’t quite as smooth as the Ohm’s law one. But we understand that humans aren’t quite as homogeneous as conductors, right? So, we don’t get too disheartened by the relatively more messiness (e.g., not all dots falling on a line) in social data.
From the plot, it appears that a straight line (with an intercept) may be a pretty good fit for the data as customer satisfaction tends to increase (almost) linearly as monthly spending increases. Are you observing any similarity between this relationship and the voltage-current relationship? 🤔
I propose the following model to mathematically express the relationship:
Customer Satisfaction = 3.33 + 0.01*Monthly Spending
However, you propose a revision to the model: a quadratic term should be added. I accept your proposal and update the model as follows:
Customer Satisfaction = 3.2 + 0.01*Monthly Spending-0.0001*(Monthly Spending)²
Maybe someone else proposes some other mathematical model that fits the data even better. But, wait for a second!
Let’s stop worrying about mathematics for a bit and compare the Ohm’s law data and the customer data:
Given the above differences, it may be inappropriate to interpret the findings of your model the same way you interpret the findings of your model in the Ohm’s Law case.
For example, solely based on the model you have, the following counterfactual and causal claims are possibly incorrect:
“Had the less-spending customers spent more, their satisfaction would have been higher”
“An increase in monthly spending causes an increase in customer satisfaction”
The correct interpretations are:
- Customers who spend more tend to be more satisfied
- Monthly spending is positively associated with/related to customer satisfaction
Although the above interpretations are technically correct, they are not helpful for your supervisor/manager/client who assigned you the data analysis responsibility to figure out possible ways to increase customer satisfaction. The interpretations are wishy-washy because you are just describing a passive observation, whereas they need an active policy solution.
The problem here is more than just about proposing a model that fits the data; it requires identifying a policy change that would increase customer satisfaction.
You Need a Theory of How the World Works
Let’s pretend the process that generates our customer data can be expressed as follows:
So here is the theory/narrative to explain the data-generating process:
“Higher income customers tend to spend more at the store. Also, higher income customers tend to be more satisfied with the store (maybe because the store is a high-end one).”
Because I disregarded the data-generating process while modeling, based on the following model:
Customer Satisfaction = 3.33 + 0.01*Monthly Spending
I suggest that a $1 ($100) increase in customer spending will increase customer satisfaction by 0.01 (1) points (point).
This suggestion will not be helpful from a policy perspective, though. For example, if the store manager raises product prices to make customers spend more, their satisfaction may actually go down!
Interesting, isn’t it?
In the above example, we observed a case in which we can predict the value of an outcome variable based on the value of an explanatory variable. Let’s call this a factual prediction. However, for policy advice, we need a counterfactual prediction (i.e., we need to predict what would have happened had we actively changed the value of something keeping all else constant). Unfortunately, our model is not helpful for counterfactual prediction despite being pretty good at factual prediction.
How to deal with this issue? The general solution is to figure out the common causes of the treatment variable and the outcome variable, and add them to your model as additional explanatory variables. I describe this briefly in another blog:
What Does It Mean to Control for a Variable in Regression?
An intuitive explanation
Curve Fitting is Extremely Useful, Though
If you have read this far, you may wonder: why do almost all data science education focus on curve fitting if these practices may not be helpful for policy advice?
In many real-world applications, all you need to do is propose a model that can reliably predict the value of some Y based on the values of some Xs.
Let’s pretend you work for an insurance company. The insurer wants to determine which potential insurees are most likely to go bankrupt. This is a scenario in which you may be happy with a model that can make factual predictions (and not necessarily counterfactual predictions).
Assume that you have a dataset that has only two variables. You know the number of traffic violations someone has committed and whether they are bankrupt. Based on the two variables, you propose the following model:
Bankrupt = 0.05 + 0.2*Number of Traffic Violations
Let’s pretend that using the above model, you can reasonably predict the probability of an insuree going bankrupt.
How about proposing a policy solution based on the above model? It appears that the probability of bankruptcy would be lowest if we set the number of traffic violations to 0. However, it may not be great policy advice if we propose that by pardoning all traffic violations (i.e., setting the value to 0), we can drastically reduce the incidence of bankruptcy.
**I wish someone had suggested the following at the beginning of my data science education**
Before modeling any data,
First, ruminating over the data-generating process is a great idea.
Second, we need to ask ourselves: why are we analyzing the data? Are we just trying to predict the value of an outcome variable as a passive observer, or are we trying to propose a piece of policy advice to an active policymaker?
Given that we have a clear understanding of these two issues, only and only then should we think about the appropriate mathematical modeling.
In case you would like to read some more beginner-level articles on causal inference, check out the following:
Regression and Causal Inference: Which Variables Should Be Added to The Model?
Struggle and (Potential) Remedy
How to Explore the Effect of Doing Something? (Part 1)
Applied Causal Inference 101: Counterfactual Worlds and The Experimental Ideal
How to Explore the Effect of Doing Something? (Part 2)
Applied Causal Inference 101: Non-Experimental Data