Sitemap

Analyzing Survey Data in R: A Crash Course (Part 2)

A step-by-step lesson

8 min readOct 21, 2023

--

Press enter or click to view image in full size
Photo by Nick Fewings on Unsplash

Welcome to the second part of our crash course on survey data analysis in R!

**You can find the video lectures of this crash course on the course website😊**

In this article, I share with you the following lessons:

Task 1: Importing a .csv file from your local drive

Task 2: Merging multiple rounds/phases of a survey by common columns

Task 3: Creating a line chart

Task 4: Estimating an empirical model using linear regression

Task 5: Saving graphs and tables

Extra Task: Creating a grouped line chart

If you missed the first part of this course, here is the link to it:

For these tasks, I use data from the Survey of Household Economics and Decisionmaking (SHED) 2017–2022 conducted by the Federal Reserve Board. These datasets are publicly available here: https://www.federalreserve.gov/consumerscommunities/shed_data.htm

The goal of our analysis is to explore how people’s perceptions of the economic condition evolved between 2017 and 2022.

Task 1: Importing a .csv file from your local drive

Let’s follow these steps:

1. Download the 6 SHED files (2017–22) from the Federal Reserve Board website. These files are stored in zip folders. Click on them and copy the .csv file located inside the folder.

2. Create a folder in your local drive. Paste all the .csv files in it. Here is how it looks on my computer:

Press enter or click to view image in full size

3. Next, create an R script and save it in the folder where all the SHED files are located. I named my script shed_code.

Press enter or click to view image in full size

4. Next, click: Session → Set Working Directory → To Source File Location

Press enter or click to view image in full size

5. Once you do the above, you will see the following in your console:

Press enter or click to view image in full size

Copy the setwd(…….) part and paste the code into your script.

Next time you run the code, it will direct R to the right place.

6. Import the 6 SHED files by running the following code:

#Import the 6 SHED datasets and name them 
shed2017<-read.csv("public2017.csv")
shed2018<-read.csv("public2018.csv")
shed2019<-read.csv("public2019.csv")
shed2020<-read.csv("public2020.csv")
shed2021<-read.csv("public2021.csv")
shed2022<-read.csv("public2022.csv")

Great! The 6 SHED files have been imported.

Task 2: Merging multiple rounds/phases of a survey by common columns

Now, we want to merge the six SHED files and combine them into one grand dataset. We call this new dataset shed_data.

But how exactly do we do it? 🤔 Here is a plan: 👇🏾

  1. Create a Year variable in each dataset and put the name of the relevant year in it. We are doing this so that once we merge these 6 datasets, we can identify the respondents in different years.
  2. Create a shed_weight variable in each dataset and put the weight variables in each SHED dataset inside it. Why do we need this? One annoying thing is that different SHED datasets have different names for these weight variables. We are creating a common weight variable to generate weighted statistics.
  3. Identify the columns that are common across the 6 datasets.
  4. Merge the 6 datasets using these common columns.

(1) Create a year variable (i.e., column) in each dataset

#Create a Year variable
shed2017$Year <- 2017
shed2018$Year <- 2018
shed2019$Year <- 2019
shed2020$Year <- 2020
shed2021$Year <- 2021
shed2022$Year <- 2022

(2) Create a common weight variable in each dataset

#Create a common weight variable (shed_weight) based on the most commonly used weights in each SHED
shed2022$shed_weight <- shed2022$weight_pop
shed2021$shed_weight <- shed2021$weight_pop
shed2020$shed_weight <- shed2020$weight_pop
shed2019$shed_weight <- shed2019$weight_pop
shed2018$shed_weight <- shed2018$weight2b
shed2017$shed_weight <- shed2017$weight3b

(3) Identify the common columns

To identify the common columns, run the following code:

#Identify the common columns across the 6 datasets
common_col_names <- Reduce(intersect, list(names(shed2017),
names(shed2018),
names(shed2019),
names(shed2020),
names(shed2021),
names(shed2022)))

The above code may seem complicated, but do not worry at all!

It uses four functions: Reduce( ), intersect( ), list ( ), and names( ).

You can easily use the same code for other survey datasets. All you need to change is the name of the datasets inside the names( ) function.

(4) Merge by the common columns

We use the rbind( ) and select ( ) functions from the dplyr package and merge the 6 datasets by running the following code:

install.packages('dplyr')

library(dplyr)

#Merging all 4 by the common columns
shed_data <- rbind(shed2017 %>% select(common_col_names),
shed2018 %>% select(common_col_names),
shed2019 %>% select(common_col_names),
shed2020 %>% select(common_col_names),
shed2021 %>% select(common_col_names),
shed2022 %>% select(common_col_names))

This is what I get:

So, now we have the 6 original SHED datasets and a merged dataset.

One final thing. The problem is that in the merged dataset, some weights are NA (i.e., missing). We want to eliminate these cases because, without doing so, we cannot generate weighted statistics.

nrow(shed_data %>% filter(is.na(shed_weight))) #this shows the total NAs

shed_data <- shed_data %>% filter(!is.na(shed_weight)) #this eliminates the NAs

So, we lost 260 respondents out of 71,125 respondents. I suppose that is fine.

Press enter or click to view image in full size

Task 3: Creating a line chart

Now that we have the merged dataset, we want to create a line chart that shows how people’s perceptions of the state of the economy varied between 2017 and 2022.

First, let’s understand the B7_b variable by finding it in the codebook:

Press enter or click to view image in full size

Let’s look at the distribution of this variable:

So, the B7_b is a categorical variable with 5 levels (Excellent, Good, Only fair, Poor, and Refused). Let’s keep the first four categories as the last one (i.e., Refused) is not useful for our analysis.

shed_data <- shed_data %>% 
filter(B7_b=="Excellent"|B7_b=="Good"|B7_b=="Only fair"|B7_b=="Poor")

Let’s look at the distribution one more time:

Press enter or click to view image in full size

Now, let’s create a binary/dichotomous variable (economy_poor) from this categorical variable (B7_b) using the ifelse( ) function. This economy_poor variable takes a value of 1 if B7_b == “Poor” and a 0 otherwise.

#create a dummy indicating whether the respondent thinks the economic conditions are poor
shed_data$economy_poor <- ifelse(shed_data$B7_b=="Poor",1,0)

Let’s see the distribution of the economy_poor variable:

Now, let’s create a table that shows the proportion of respondents who perceived the economic condition to be poor in each year (2017–2022).

#create a table of the proportion of respondents saying the economic conditions are poor
shed_data %>% group_by(Year) %>%
summarise(Poor=round(mean(economy_poor)*100,digits=2))

The above code returns the following output:

Let’s use this table to create our basic line chart:

install.packages('ggplot2')

library(ggplot2)

#create a line chart
shed_data %>% group_by(Year) %>%
summarise(Poor=round(weighted.mean(economy_poor,shed_weight)*100,digits=2)) %>%
ggplot(aes(y=Poor, x=factor(Year),group=1)) +
geom_line(size=2)+labs(title ="% of Americans Saying Current Economic Conditions Are Poor",y="Percent",x="")+
theme_bw()
Press enter or click to view image in full size

And we have our basic line chart!

Task 4: Estimating an empirical model using linear regression

This is perhaps fairly straightforward!

Let’s estimate a model that predicts the probability that a respondent says the current economic conditions are poor.

We use linear regression for the estimation process, using the ppethm and Year variables as predictors.

(Of course, you can add many more predictors. I am showing you a simpler example).

I use the feols ( ) function from the fixest package. I prefer this function for multiple reasons, including the ease of using heteroskedasticity-robust standard errors and presenting results in tables.

I estimate two models and present them side-by-side:

Model 1: Poor = b0 + b1*ppethm + b2*Year + e

Model 2: Poor = b0 + b1*ppethm + b2*Year + b3*ppethm*year + e

install.packages('fixest')

library(fixest)

#estimate models using weighted least squares

model1 <- feols (economy_poor~factor(ppethm)+factor(Year), data=shed_data, weight=shed_data$shed_weight)

model2 <- feols (economy_poor~factor(ppethm)+factor(Year)+factor(ppethm):factor(Year), data=shed_data, weight=shed_data$shed_weight, se="hetero") #using heteroskedasticity-robust standard error

etable(model1,model2,digits = 2)

Task 5: Saving graphs and tables

Let’s save the above regression table as a .csv file in the folder where the R code and the SHED data files are located.

We do the following:

  1. Save the regression table as a dataframe (let’s name it reg_table)
  2. Use the write.csv( ) function to export the file

Let’s run the following code:

reg_table<- etable(model1,model2)

write.csv(reg_table,"reg_table.csv")
Press enter or click to view image in full size

As we can see, the reg_table.csv file has been exported to my working folder.

To export the line chart we created earlier, we use the ggsave( ) function immediately after the code used to generate the graph.

shed_data %>% group_by(Year) %>%
summarise(Poor=round(weighted.mean(economy_poor,shed_weight)*100,digits=2)) %>%
ggplot(aes(y=Poor, x=factor(Year),group=1)) +
geom_line(size=2)+labs(title ="% of Americans Saying Current Economic Conditions Are Poor",y="Percent",x="")+
theme_bw()

ggsave("economy_poor_line_chart.png", width = 13, height = 9, units = "in")

And, just like the table, the figure has been exported to my working folder:

Press enter or click to view image in full size

Extra Task: Creating a grouped line chart

Let’s pretend that you would like to investigate how the percentage of people saying the economic conditions are poor varies between Non-Hispanic White respondents and Other respondents.

In SHED, the ppethm variable shows people’s racial and ethnic identities.

Press enter or click to view image in full size

Let’s make a non_hispanic_white dummy variable using the ifelse( ) function:

#create a Non-Hispanic White dummy
shed_data$non_hispanic_white<- ifelse(shed_data$ppethm=="White, Non-Hispanic","Non-Hispanic White", "Other")

Now, let’s create a line chart with two lines (one for Non-Hispanic White respondents and one for the Other respondents):

#create a line chart
shed_data %>% group_by(Year,non_hispanic_white) %>%
summarise(Poor=round(weighted.mean(economy_poor,shed_weight)*100,digits=2)) %>%
ggplot(aes(y=Poor, x=factor(Year),group=factor(non_hispanic_white),linetype=factor(non_hispanic_white))) +
geom_line(size=2)+
labs(title ="% of Americans Saying Current Economic Conditions Are Poor",y="Percent",x="",linetype="Group")+
theme(axis.text.x = element_text(size=15),
panel.background=element_blank(),
plot.title = element_text(face = "bold", size = 25, hjust = 0.5),
axis.text.y = element_text(size=15),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
legend.text = element_text(size=15),
legend.position = "top",
legend.title=element_text(size=15))
Press enter or click to view image in full size

Thank you so much for following along 😊

Here is the link to the third and final part of this crash course:

** If you found this article helpful, please consider following me! Also, please consider becoming an Email Subscriber so that you receive an email next time I post something!**

--

--

Vivekananda Das
Vivekananda Das

Written by Vivekananda Das

Sharing synthesized ideas on data and behavior | Researcher | Educator | Connect with me: https://www.linkedin.com/in/vivekananda-das-421922385/

No responses yet