Analyzing Data in Social Science vs. Engineering: Key Differences to Keep In Mind
This week, as I am wrapping up another semester in the applied social sciences, I have been revisiting some of the key challenges I faced while transitioning from engineering to social science.
In this article, I discuss some key differences between typical engineering and social data and how these differences require different quantitative modeling and interpretation approaches.
***Before we begin: neither all engineering data are experimental, nor all social data are non-experimental. Also, engineers do not always deal with data that follows from a law/theory. I am using simple (and perhaps too stereotypical) examples here for the sake of explaining the key points of this article***
Let’s begin with Ohm’s law, a classic physics experiment many of us conducted in high school. Let’s revisit the law:
Ohm’s law states that the current through a conductor between two points is directly proportional to the voltage across the two points.
Introducing the constant of proportionality, the resistance, one arrives at the usual mathematical equation that describes this relationship:
I = V/R
where I is the current through the conductor, V is the voltage measured across the conductor, and R is the resistance of the conductor
Here are a few things to note from a data analyst’s perspective:
- The unit of analysis is “one conductor”
- When you are running the experiment, you are changing voltage across two points of the same conductor
- The conductor remains the same throughout the experiment
- You change voltage (independent/explanatory/treatment variable)
- You observe a change in the flow of current (dependent/outcome variable)
- You (perhaps unconsciously) assume that, for this conductor, at a particular voltage (e.g., 0.8 V), the observed current (e.g., 0.38 A) would be the same regardless of when you measure it. To elaborate on this, let’s say, you ran an experiment in which you gradually increased voltage in this sequence: 0V, 1V, and 2V, and observed current in this sequence: 0A, 0.49A, and 0.98A. You are assuming that had you run the experiment in the reverse order, for example, in this sequence: 2V, 1V, and 0V, you would have observed current in this sequence 0.98A, 0.49A, and 0A. To summarize, you are assuming that there are pairs of voltage and current — for example, (OV and 0A), (1V and 0.49A), (2V and 0.98A), etc. — that do not change depending on the sequence of experimentation.
Next, you create a plot like this:
- Because this data is experimental, you definitely know the treatment variable (i.e., voltage) and the outcome variable (i.e., current). By treatment variable, we refer to the cause, and by outcome variable, we refer to the effect.
- Working with data like Ohm’s law one is a massive privilege — you are not stuck in a situation in which you have to ponder over whether X causes Y, or Y causes X, or both X and Y are caused by Z. Here, you clearly know that the change in voltage is the cause (treatment), and the change in current is the effect (outcome).
- You can make a counterfactual claim, such as “Had I not changed the voltage, the current wouldn’t have changed.”
- Also, you can make a causal claim, such as “A change in voltage across the conductor causes a change in current through the conductor.”
- There is reasonable homogeneity, i.e., whether you run the experiment in the U.S. in 1923 or Bangladesh in 2023, for the same type of conductor, you will get the same numbers, and they will fall on a line, ignoring negligible noise introduced by measurement errors, power loss, etc.
For this conductor, you can describe the relationship between current and voltage as follows:
I = 0.49 * V
So, your key conclusion would be: with a 1 unit increase in voltage across the conductor, the current through the conductor increases by 0.49 units.
Note that we already had a law (I=V/R), which we tested using empirical data. Put differently, in this case, we already knew the data-generating process (as described by Ohm’s law).
Unfortunately, given the enormous complexity of the social world, often, there isn’t a robust law /theory that we can test using empirical data. In other words, unless we are running a randomized experiment in a controlled environment, we rarely know the data-generating process.
Often, using empirical data, we try to understand social laws/propose social theories.
To make things even more complicated, these social laws/theories can change across time and/or societies and require frequent updates.
Modeling Like an Engineer, but the Data Are Social
Not all engineering datasets are as smooth as Ohm’s law data. However, generally, they tend to be less messy than social data, which facilitates our modeling of the relationships between/among variables using mathematical formulae.
Sometimes, while modeling relationships among variables using social data, someone trained as an engineer may disregard the underlying data-generating processes and focus too much on curve fitting (i.e., developing a mathematical expression that best fits the data).
Importantly, social data tend to be non-experimental more often than engineering data. The curve fitting practices, which work perfectly for experimental data, may not be the best idea for non-experimental data (especially if you propose a change in a policy/treatment to achieve a desirable outcome).
Let’s pretend I am working with a dataset that has data on the monthly spending by customers at a store (in $) and their satisfaction with the store (measured on a scale from 1 to 10). I create a scatter plot that looks like this:
This plot isn’t quite as smooth as the Ohm’s law one. But we understand humans aren’t quite as homogeneous as conductors, right? So, we don’t get too disheartened by the relatively more messiness in social data (e.g., not all dots falling on a line).
From the plot, it appears that a straight line (with an intercept) may be a pretty good fit for the data, as customer satisfaction tends to increase (almost) linearly as monthly spending increases.
Are you observing any similarity between the spending-satisfaction relationship and the voltage-current relationship? 🤔
I propose the following model to mathematically express the relationship:
Customer Satisfaction = 3.33 + 0.01*Monthly Spending
However, you propose a revision to the model: a quadratic term should be added. I accept your proposal and update the model as follows:
Customer Satisfaction = 3.2 + 0.01*Monthly Spending-0.0001*(Monthly Spending)²
Maybe someone else proposes another mathematical model that fits the data even better. But wait for a second!
Let’s stop worrying about mathematics for a bit and compare Ohm’s law data and the customer data:
Given the above differences, it may be inappropriate to interpret the findings of your model the same way you interpret the findings of your model in the Ohm’s Law case.
For example, solely based on the model you have, the following counterfactual and causal claims are possibly incorrect:
“Had the less-spending customers spent more, their satisfaction would have been higher”
“An increase in monthly spending causes an increase in customer satisfaction”
The correct interpretations are:
- Customers who spend more tend to be more satisfied
- Monthly spending is positively associated/correlated with customer satisfaction
Although the two interpretations mentioned above are technically correct, they are not helpful for your supervisor/manager/client who assigned you the data analysis responsibility to determine possible ways to increase customer satisfaction.
The interpretations are wishy-washy because you just described a passive observation, whereas they need an active policy solution.
The problem here is more than just about proposing a mathematical model that fits the data; it requires identifying a policy change that would increase customer satisfaction.
You Need a Theory of How the World Works
Let’s pretend the process that generates our customer data can be expressed as follows:
So here is the theory/narrative to explain the data-generating process:
“Higher income customers tend to spend more at the store. Also, higher income customers tend to be more satisfied with the store (maybe because the store is high-end).”
Because I disregarded the data-generating process in the following model,
Customer Satisfaction = 3.33 + 0.01*Monthly Spending
I suggested that a $1 ($100) increase in customer spending would increase customer satisfaction by 0.01 (1) points (point).
This suggestion may not be helpful from a policy perspective, though. For example, if the store manager raises product prices to make customers spend more, their satisfaction may actually go down!
Interesting, isn’t it?
In the above example, we observed a case in which we can predict the value of an outcome variable based on the value of an explanatory variable. Let’s call this a factual prediction. However, for policy advice, we need a counterfactual prediction (i.e., we need to predict what would have happened had we actively changed the value of something, keeping all else constant). Unfortunately, despite being pretty good at factual prediction, our model is not helpful for counterfactual prediction.
How do we deal with this issue? The general solution is to figure out the common causes of the treatment and outcome variables and add them to your model as additional explanatory variables. I describe this briefly in another blog:
What Does It Mean to Control for a Variable in Regression?
A detailed explanation
Curve Fitting is Extremely Useful, Though
If you have read this far, you may wonder why almost all data science education focuses on curve fitting if these practices may not be helpful for policy advice.
In many real-world applications, all you need to do is propose a model that can reliably predict the value of some Y based on the values of some Xs.
Let’s pretend you work for an insurance company. The insurer wants to determine which potential insurees will most likely go bankrupt. This is a scenario in which you may be happy with a model that can make factual predictions (and not necessarily counterfactual predictions).
Assume that you have a dataset that has only two variables. You know the number of traffic violations someone has committed in the previous year and whether they are bankrupt. Based on the two variables, you propose the following model:
Bankrupt = 0.05 + 0.1*Number of Traffic Violations
Let’s pretend that using the above model, you can reasonably predict the probability of an insuree going bankrupt.
How about proposing a policy solution based on the above model? It appears that the probability of bankruptcy would be lowest if we set the number of traffic violations to 0. However, it may not be a piece of great policy advice if we propose that by pardoning all traffic violations (i.e., setting the value to 0), we can drastically reduce the incidence of bankruptcy.
**I wish someone had suggested the following at the beginning of my data science education**
Before modeling any data,
First, ruminating over the data-generating process is a great idea.
Second, we should ask ourselves why we are analyzing the data. Are we just trying to predict the value of an outcome variable as a passive observer, or are we trying to propose a piece of policy advice to an active policymaker?
Given that we clearly understand these two issues, only and only then should we think about the appropriate mathematical modeling.
If you would like to read some more beginner-level articles on causal inference, here are some suggestions:
Regression and Causal Inference: Which Variables Should Be Added to The Model?
Struggle and (Potential) Remedy
How to Explore the Effect of Doing Something? (Part 1)
Applied Causal Inference 101: Counterfactual Worlds and The Experimental Ideal
How to Explore the Effect of Doing Something? (Part 2)
Applied Causal Inference 101: Non-Experimental Data