Analyzing Survey Data in R: A Crash Course (Part 3)

A step-by-step lesson

Vivekananda Das
7 min readDec 25, 2023
Photo by Towfiqu barbhuiya on Unsplash

Welcome to the third and final part of our crash course on survey data analysis in R!

**You can find the video lectures of this crash course in the course website😊**

In this article, I am going to share with you the following lessons:

Task 1: Merging multiple segments of a survey by common columns

Task 2: Summarizing a continuous variable

If you missed the first two parts of this course, here are the links to them:

In this article, I am going to use data from the American Time Use Survey (ATUS) conducted annually by the U.S. Bureau of Labor Statistics. These datasets are publicly available here: https://www.bls.gov/tus/data.htm

We will use data from ATUS 2022.

Let’s pretend that the goal of our analysis is to get a sense of the duration of the walking trips made by Americans in 2022.

**To avoide confusion, note that the unit of analysis here is time spent walking “per trip” (not “per person per day”)**

Task 1: Merging multiple segments of a survey by common columns (e.g., Respondent IDs)

In the webpage that contains ATUS 2022 data, you can find the following:

Note that there are multiple segments of the ATUS data. For example, ATUS 2022 Respondent file, ATUS 2022 Roster file, etc.

To merge multiple segments of ATUS into one dataframe, let’s follow these steps:

1) Download the ATUS 2022 Respondent file (zip) and the ATUS 2022 Activity file (zip). These files are stored in zip folders. Click on them and copy the .DAT files located inside the zip folders.

[You can merge other segments of the ATUS as well. But, I will keep things simple and create a combined dataset using these two files.]

2. Create a folder in your local drive. Paste the two .DAT files in it. Here is how it looks on my computer:

3. Next, create an R script and save it in the folder where all the ATUS files are located. I named my script atus_analysis_code

4. Next, click: Session → Set Working Directory → To Source File Location

5. Once you do the above, you will see the following in your console:

Copy the setwd(…….) part and paste the code into your script.

Next time you run the code, it will direct R to the right place (which means you won’t have to click on anything).

#Setting the working directory
setwd("F:/Survey_data_analysis_in_R/ATUS_data_analysis")

6. load the dplyr and ggplot2 pakcages

#Load packages
library(dplyr)
library(ggplot2)

7. Import the two ATUS files by running the following code:

#Importing the two segments of ATUS 2022
act <- read.csv("atusact_2022.DAT")
resp <- read.csv("atusresp_2022.DAT")

8. Amazing! The two ATUS files have been imported.

9. Merge the two dataframes using the merge ( ) function.

Note that the only common column/variable between the two dataframes is TUCASEID. Therefore, we want to combine them based on this common column.

#Merging the two segments of ATUS 2022 by the common column
data <- merge(resp, act, by="TUCASEID")

If you had multiple common columns — for example, let’s say X and Y other than TUCASEID — you would have used the c( ) function in the code to include them all in the merging process. Below, I share the code applicable to that hypothetical scenario:

data <- merge(resp, act, by=c("TUCASEID","X","Y"))

10. Great! Now, I can see the merged dataframe, called data, in the global environment.

Task 2: Summarizing a continuous variable

To investigate our research question, we need three variables from the ATUS, which are TEWHERE, TUACTDUR, and TUDIARYDAY.

Below, I present a detailed description of these variables by taking screenshots from the ATUS codebook.

Create a Subset that Has Observations on Walking Trips

First, let’s create a subset that contains observations on walking trips (i.e., TEWHERE==14):

#Creating a subset of walking trips
walking <- data %>% filter(TEWHERE==14)

Five Number Summary

We can use the fivenum( ) function to get summary statistics (minimum, 1st Quartile, Median, 3rd Quartile, and Maximum) of the TUACTDUR variable, which shows the time respondents spent in walking trips (in minutes):

#Five Number Summary of Time Spent Walking in a Day
fivenum(walking$TUACTDUR)

Based on the above, we learn the following:

Minimum=1; 1st Quartile=5; Median=8; 3rd Quartile=15; Maximum=362 (in minutes)

Histogram

Next, we will create a histogram by using functions from the ggplot2 package.

#Histogram

ggplot(walking, aes(x=TUACTDUR))+
geom_histogram(binwidth = 1)+
theme_classic()+
labs(x="Duration (in minutes)")

It looks like the tail of the histogram is really long although there are hardly any observations above 100 minutes. Therefore, we can limit the x-axis values between 0 and 100 by using the xlim( ) function:

#Histogram

ggplot(walking, aes(x=TUACTDUR))+
geom_histogram(binwidth = 1)+
theme_classic()+
labs(x="Duration (in minutes)")+
xlim(0,100)

We can play around with the binwidth ( ) function. For example, below I present a histogram using binwidth = 5.

#Histogram

ggplot(walking, aes(x=TUACTDUR))+
geom_histogram(binwidth = 5)+
theme_classic()+
labs(x="Duration (in minutes)")+
xlim(0,100)

Depending on the context of your research, you should pick your binwidth. It is difficult to provide a generic response on what the ideal binwidth is.

Boxplot

Next, we can create a boxplot by using the following code:

[Note the similarity between the code for a boxplot and the code for a histogram]

#Boxplot

ggplot(walking,aes(x=TUACTDUR))+
geom_boxplot()+
theme_classic()+
labs(x="Duration (in minutes)",fill="Diary Day")+
xlim(0,100)+
theme(axis.text.y=element_blank(), #eliminates y axis text
axis.ticks.y=element_blank()) #eliminates y axis ticks

The above boxplot gives us a sense of the distribution of the TUACTDUR variable for the whole sample (i.e., respondents in the walking dataframe).

Now, let’s pretend that we want to incorporate the TUDIARYDAY variable in the boxplot and get separate boxplots for each day (i.e., Sunday, Monday, Tuesday, etc.). We can do that by using the following code:

#Boxplot

ggplot(walking,aes(x=TUACTDUR,fill=factor(TUDIARYDAY)))+
geom_boxplot()+
theme_classic()+
labs(x="Duration (in minutes)",fill="Diary Day")+
xlim(0,100)+
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank())

Looks good!

Lastly, we can bring one more change to the above boxplot.

To make it more professional, we can change the levels of the TUDIARYDAY variable from numbers to the names of the days so that our readers understand the boxplot more easily.

To do that, we run the following code:

#Boxplot

#Recoding the levels of the TUDIARYDAY variable
walking$TUDIARYDAY <- recode(walking$TUDIARYDAY,"1"="Sunday",
"2"="Monday",
"3"="Tuesday",
"4"="Wednesday",
"5"="Thursday",
"6"="Friday",
"7"="Saturday")

#Reordering the levels of the TUDAIARYDAY variable and creating the boxplot
walking %>% mutate(TUDIARYDAY = factor(TUDIARYDAY, levels = c("Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday")))%>%
ggplot(aes(x=TUACTDUR,fill=factor(TUDIARYDAY)))+
geom_boxplot()+
theme_classic()+
labs(x="Duration (in minutes)",fill="Diary Day")+
xlim(0,100)+
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank())

This is the end of our crash course. I hope you found these lessons useful!

Please let me know if you have any questions.

Thank you again for following along 😊

** If you found this article helpful, please consider following me! Also, please consider becoming an Email Subscriber so that you receive an email next time I post something!**

--

--

Vivekananda Das

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Wellbeing | Ph.D. candidate @UW-Madison | More: https://vivekanandadas.com