Analyzing Survey Data in R: A Crash Course (Part 2)

A step-by-step lesson

Vivekananda Das
9 min readOct 21, 2023
Photo by Nick Fewings on Unsplash

Welcome to the second part of our crash course on survey data analysis in R!

**You can find the video lectures of this crash course in the course website😊**

In this article, I am going to share with you the following lessons:

Task 1: Importing a .csv file from your local drive

Task 2: Merging multiple rounds/phases of a survey by common columns

Task 3: Creating a line chart

Task 4: Estimating an empirical model using linear regression

Task 5: Saving graphs and tables

Extra Task: Creating a grouped line chart

If you missed the first part of this course, here is the link to it:

In this article, I am going to use data from the Survey of Household Economics and Decisionmaking (SHED) conducted annually by the Federal Reserve Board. These datasets are publicly available here: https://www.federalreserve.gov/consumerscommunities/shed_data.htm

We will use data from 2017–2022 SHED.

Let’s pretend that the goal of our analysis is to explore how people’s perceptions of the economic condition evolved between 2017 and 2022.

Task 1: Importing a .csv file from your local drive

Let’s follow these steps:

1. Download the 6 SHED files (2017–22) from the Federal Reserve Board website. These files are stored in zip folders. Click on them and copy the .csv file located inside the folder.

2. Create a folder in your local drive. Paste all the .csv files in it. Here is how it looks on my computer:

3. Next, create an R script and save it in the folder where all the SHED files are located. I named my script shed_code.

4. Next, click: Session → Set Working Directory → To Source File Location

5. Once you do the above, you will see the following in your console:

Copy the setwd(…….) part and paste the code into your script.

Next time you run the code, it will direct R to the right place (which means you won’t have to click on anything).

#Setting the working directory
setwd("E:/SHED/SHED_Analysis")

6. Import the 6 SHED files by running the following code:

#Import the 6 SHED datasets and name them 
shed2017<-read.csv("public2017.csv")
shed2018<-read.csv("public2018.csv")
shed2019<-read.csv("public2019.csv")
shed2020<-read.csv("public2020.csv")
shed2021<-read.csv("public2021.csv")
shed2022<-read.csv("public2022.csv")

Great! The 6 SHED files have been imported.

Task 2: Merging multiple rounds/phases of a survey by common columns

Now, we want to merge the 6 SHED files and combine them into one grand dataset. We will call this new dataset shed_data.

But how exactly do we do it? 🤔 Here is a plan 👇🏾

  1. Create a Year variable in each dataset and put the name of the relevant year in it. We are doing this so that once we merge these 6 datasets, we can identify the respondents in different years.
  2. Create a shed_weight variable in each dataset and put the weight variables in each SHED dataset inside it. Why do we need this? One annoying thing is that different SHED datasets have different names of these weight variables. So we are creating a common weight variable that can be used to generate weighted statistics.
  3. Identify the columns that are common across the 6 datasets.
  4. Merge the 6 datasets using these common columns.

(1) Create a year variable (i.e., column) in each dataset

#Create a Year variable
shed2017$Year <- 2017
shed2018$Year <- 2018
shed2019$Year <- 2019
shed2020$Year <- 2020
shed2021$Year <- 2021
shed2022$Year <- 2022

(2) Create a common weight variable in each dataset

#Create a common weight variable (shed_weight) based on the most commonly used weights in each SHED
shed2022$shed_weight <- shed2022$weight_pop
shed2021$shed_weight <- shed2021$weight_pop
shed2020$shed_weight <- shed2020$weight_pop
shed2019$shed_weight <- shed2019$weight_pop
shed2018$shed_weight <- shed2018$weight2b
shed2017$shed_weight <- shed2017$weight3b

(3) Identify the common columns

To identify the common columns, we run the following code:

#Identify the common columns across the 6 datasets
common_col_names <- Reduce(intersect, list(names(shed2017),
names(shed2018),
names(shed2019),
names(shed2020),
names(shed2021),
names(shed2022)))

The above code may seem complicated but don’t worry at all!

It uses four functions: Reduce( ), intersect( ), list ( ), and names( ).

You can easily use the same code for other survey datasets. All you need to change is the name of the datasets inside the names( ) function.

(4) Merge by the common columns

We use the rbind( ) and select ( ) functions from the dplyr package and merge the 6 datasets by running the following code:

install.packages('dplyr')

library(dplyr)

#Merging all 4 by the common columns
shed_data <- rbind(shed2017 %>% select(common_col_names),
shed2018 %>% select(common_col_names),
shed2019 %>% select(common_col_names),
shed2020 %>% select(common_col_names),
shed2021 %>% select(common_col_names),
shed2022 %>% select(common_col_names))

This is what I get:

So, now we have the 6 original SHED datasets and a merged dataset.

One final thing. The problem is that in the merged dataset, some weights are NA (i.e., missing). We want to eliminate these cases because without eliminating this missing weights, we won’t be able to generate weighted statistics.

nrow(shed_data %>% filter(is.na(shed_weight))) #this shows the total NAs

shed_data <- shed_data %>% filter(!is.na(shed_weight)) #this eliminates the NAs

So, we lost 260 respondents out of 71,125 respondents. I suppose that’s fine.

Task 3: Creating a line chart

Now that we have the merged dataset, we want to create a line chart that will show how people’s perceptions of the state of the economy varied between 2017 and 2022.

First, let’s understand the B7_b variable by finding it in the codebook:

Let’s look at the distribution of this variable:

So, the B7_b is a categorical variable with 4 levels (Excellent, Good, Only fair, and Poor). Let’s eliminate the other two categories as they are not useful for our analysis.

shed_data <- shed_data %>% 
filter(B7_b=="Excellent"|B7_b=="Good"|B7_b=="Only fair"|B7_b=="Poor")

Let’s look at the distribution one more time:

Now, let’s create a binary/dichotomous variable (economy_poor) from this categorical variable (B7_b) using the ifelse( ) function. This economy_poor variable will take a value of 1 if B7_b == “Poor” and a 0 otherwise.

#create a dummy indicating whether the respondent thinks the economic conditions are poor
shed_data$economy_poor <- ifelse(shed_data$B7_b=="Poor",1,0)

Let’s see the distribution of the economy_poor variable:

Now, let’s create a table that shows the proportion of respondents who perceived the economic condition to be poor in each year (2017–2022).

#create a table of proportion of respondents saying the economic conditions are poor
shed_data %>% group_by(Year) %>%
summarise(Poor=round(mean(economy_poor)*100,digits=2))

The above code returns the following output:

Let’s use this table to create our basic line chart:

install.packages('ggplot2')

library(ggplot2)

#create a line chart
shed_data %>% group_by(Year) %>%
summarise(Poor=round(weighted.mean(economy_poor,shed_weight)*100,digits=2)) %>%
ggplot(aes(y=Poor, x=factor(Year),group=1)) +
geom_line(size=2)+labs(title ="% of Americans Saying Current Economic Conditions Are Poor",y="Percent",x="")+
theme_bw()

And we have our basic line chart!

Task 4: Estimating an empirical model using linear regression

This is perhaps fairly straight forward!

Let’s estimate a model which predicts the probability that a respondent says the current economic conditions are poor. We will use linear regression for the estimation process and use the ppethm and Year variables as the predictor (of course, you can add many more predictors; I am just showing you simpler cases).

I am going to use the feols ( ) function from the fixest package. I prefer this function for multiple reasons, such as the ease of using heteroskedasticity-robust standard errors, presenting results in tables, etc.

I will estimate two models and present them side-by-side:

Model 1: Poor = b0 + b1*ppethm + b2*Year + e

Model 2: Poor = b0 + b1*ppethm + b2*Year + b3*ppethm*year + e

install.packages('fixest')

library(fixest)

#estimate models using weighted least squares

model1 <- feols (economy_poor~factor(ppethm)+factor(Year), data=shed_data, weight=shed_data$shed_weight)

model2 <- feols (economy_poor~factor(ppethm)+factor(Year)+factor(ppethm):factor(Year), data=shed_data, weight=shed_data$shed_weight, se="hetero") #using heteroskedasticity-robust standard error

etable(model1,model2,digits = 2)

Task 5: Saving graphs and tables

Let’s save the above regression table as a .csv file in the folder where the R code and the SHED data files are located.

We do the following:

  1. Save the regression table as a dataframe (let’s name it reg_table)
  2. Use the write.csv( ) function to export the file

Let’s run the following code:

reg_table<- etable(model1,model2)

write.csv(reg_table,"reg_table.csv")

As we can see, the reg_table.csv file has been exported to my working folder.

We will end our course by exporting the line chart we created earlier.

To export the line chart, we use the ggsave( ) function immediately after the code used to generate the graph.

shed_data %>% group_by(Year) %>%
summarise(Poor=round(weighted.mean(economy_poor,shed_weight)*100,digits=2)) %>%
ggplot(aes(y=Poor, x=factor(Year),group=1)) +
geom_line(size=2)+labs(title ="% of Americans Saying Current Economic Conditions Are Poor",y="Percent",x="")+
theme_bw()

ggsave("economy_poor_line_chart.png", width = 13, height = 9, units = "in")

And, just like the table, the figure has been exported to my working folder:

Extra Task: Creating a grouped line chart

Let’s pretend that you would like to investigate how the percentage of people saying the economic conditions are poor varies between Non-Hispanic White respondents and Other respondents.

In SHED, the ppethm variable shows people’s racial and ethnic identities.

Let’s make a non_hispanic_white dummy variable using the ifelse( ) function:

#create a Non-Hispanic White dummy
shed_data$non_hispanic_white<- ifelse(shed_data$ppethm=="White, Non-Hispanic","Non-Hispanic White", "Other")

Now, let’s create a line chart with two lines (one for Non-Hispanic White respondents and one for the Other respondents):

#create a line chart
shed_data %>% group_by(Year,non_hispanic_white) %>%
summarise(Poor=round(weighted.mean(economy_poor,shed_weight)*100,digits=2)) %>%
ggplot(aes(y=Poor, x=factor(Year),group=factor(non_hispanic_white),linetype=factor(non_hispanic_white))) +
geom_line(size=2)+
labs(title ="% of Americans Saying Current Economic Conditions Are Poor",y="Percent",x="",linetype="Group")+
theme(axis.text.x = element_text(size=15),
panel.background=element_blank(),
plot.title = element_text(face = "bold", size = 25, hjust = 0.5),
axis.text.y = element_text(size=15),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
legend.text = element_text(size=15),
legend.position = "top",
legend.title=element_text(size=15))

Thank you so much for following along 😊

Here is the link to the part 3 of this crash course:

** If you found this article helpful, please consider following me! Also, please consider becoming an Email Subscriber so that you receive an email next time I post something!**

--

--

Vivekananda Das

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Wellbeing | Ph.D. candidate @UW-Madison | More: https://vivekanandadas.com