Analyzing Survey Data in R: A Crash Course (Part 1)

A step-by-step lesson

Vivekananda Das
10 min readOct 13, 2023
Photo by Prateek Katyal on Unsplash

In this crash course, I will share 15 common tasks for survey data analysis in R. I plan to share these tasks in three parts.

This course is most suitable for beginners who want to quickly learn coding in R to begin survey data analysis for their research.

**You can find the video lectures of this crash course in the course website😊**

I am assuming you know some fundamentals of R. For example, you know the following operators:

<- (assignment)

= (assignment)

== (equal)

> (greater than)

< (less than)

>= (greater than or equal)

<= (less than or equal)

!= (not equal)

| (or)

& (and)

Before we get started

Throughout this course, I will use publicly available real-world datasets. Also, I am assuming you have RStudio installed on your computer. In case you don’t, please use the cloud version of RStudio.

This article discusses the following:

Task 1: Importing a .csv file directly from the web

Task 2: Creating a subset (Selecting specific rows)

Task 3: Selecting specific columns

Task 4: Creating a binary (a.k.a. dummy) variable

Task 5: Renaming the levels of a categorical variable

Task 6: Creating a new categorical variable

Task 7: Creating a summary statistics table

Task 8: Creating a barplot

Extra Task: Creating a grouped/stacked barplot

Task 1: Importing a .csv file directly from the web

In this article, I am going to use data from the National Financial Well-Being Survey, conducted by the Consumer Financial Protection Bureau.

To download any .csv file openly available on the internet, we can use the read.csv( ) function. We do two things here:

  1. Put the URL inside the read.csv ( ) function.
  2. Name the dataframe. For example, I named it data.
data <- read.csv("https://www.consumerfinance.gov/documents/5614/NFWBS_PUF_2016_data.csv")

Task 2: Creating a subset (Selecting specific rows)

Oftentimes, the target population of our research interest is a subgroup, rather than the whole population.

For example, let’s pretend, we are interested in households with less than $50,000 annual household income. In other words, we want to analyze data from only those respondents who lived in a household with less than $50,000 annual income.

To get this subset, first I need to look at the survey codebook and see how the income variable has been coded.

So, in this dataset, the income variable has been named PPINCIMP. Also, selecting levels 1, 2, 3, and 4 would provide us with the responses for the respondents living in households with less than $50,000 annual income.

To get this subset, we are going to use the filter( ) function from the dplyr package. Let’s install the package, load it, and run the code:

install.packages("dplyr") #Installs the package

library(dplyr) #Loads the package

income50k <- data %>% filter(PPINCIMP<=4) #Gets the <$50k income subset

Once you run the above code, you will see the following:

So, now, we have the main dataframe (data) and the subset that has only lower-income respondents (income50k).

You can find the income categories inside this new dataframe:

table(income50k$PPINCIMP)

The income50k dataset only has respondents with less than $50,000 annual household income.

Task 3: Selecting specific columns

Sometimes, we may want to reduce the number of columns in our dataset. In other words, to explore our research questions, we may need only certain variables, and not others.

For example, let’s pretend that in the income50k dataset, the variables that we are interested in are: PPGENDER, PPHHSIZE, PPINCIMP, FWBscore, and finalwt.

To select the above columns/variables, we can do the following:

#Select specific columns in R
income50k <- income50k %>% select(PPGENDER,PPHHSIZE,PPINCIMP,FWBscore,finalwt)

Task 4: Creating a binary (a.k.a dummy) variable

Let’s pretend, we are not interested in all the education categories that our dataset provides. Rather we want to know something about people with less than a college degree and people with a college degree or above.

In other words, we are thinking about a binary education variable which would take a value of 1 if a respondent has a college degree or above and a 0 if their educational attainment is less than a college degree.

Let’s find the levels of the PPEDUC variable:

table(data$PPEDUC)

Let’s understand what these numbers refer to based on the information provided in the codebook:

Okay, here is the plan:

  1. Name the new variable COLLEGE.
  2. COLLEGE takes a value of 1 if PPEDUC >= 4; COLLEGE takes a value of 0 if PPEDUC < 4.

Let’s use the ifelse( ) function to create this new variable:

data$COLLEGE<- ifelse(data$PPEDUC>=4,1,0)

Let’s see the levels of COLLEGE:

Awesome! 😊

Task 5: Renaming the levels of a categorical variable

Let’s check the PPINCIMP variable’s coding one more time:

Note that the levels of the PPINCIMP variable are 1, 2, ……, and 9, which aren’t very helpful. Therefore, we would like to rename the levels to make our analyses more comprehensible.

To rename the levels of this categorical variable, we can use the recode( ) function.

Here is the plan:

  1. Inside the recode ( ) function, first, we enter this: name of the data frame(data) $ name of the column (PPINCIMP). This is what we have on the left-hand side of the assignment operator (<-) as well. We are doing this because we want to rename the levels of the PPINCIMP variable which is located inside the data dataframe as a column.
  2. Next, we follow something like this: “Old name of a level” = “New name of the level”. And we do the same for each level.
data$PPINCIMP <-  recode(data$PPINCIMP,"1"="Less than $20,000", 
"2"="$20,000 to $29,999",
"3"="$30,000 to $39,999",
"4"="$40,000 to $49,999",
"5"="$50,000 to $59,999",
"6"="$60,000 to $74,999",
"7"="$75,000 to $99,999",
"8"="$100,000 to $149,999",
"9"="$150,000 or more")

table(data$PPINCIMP)

It worked! 😊

Task 6: Creating a new categorical variable

Now let’s pretend, we are interested in people at the intersection of their gender identity and the generation that they belong to. Let’s have a quick look at these two variables individually.

table(data$PPINCIMP)
table(data$PPGENDER)

So, generation has four levels. Let’s check the codebook to learn what these numbers mean.

Again, let’s check the codebook to learn what the levels of PPGENDER mean.

Okay, so the generation variable has 4 levels and the PPGENDER variable has 2 levels. At the intersection of the two variables, there will be 4 x 2 = 8 levels.

However, the dataset doesn’t provide us with a separate variable with these 8 levels. Let’s create it ourselves!

This time I am going to use the ifelse ( ) function.

**Note that, the following code may seem a bit challenging. But nothing to worry about! You don’t have to memorize anything. Just copy and paste the following code, tweak the variable names depending on the dataset you are using, and you will be all good!**

data$GENERATION.GENDER <- ifelse(data$PPGENDER==1 & data$generation==1, 'Male, Pre-Boomer',
ifelse(data$PPGENDER==1 & data$generation==2, 'Male, Boomer',
ifelse(data$PPGENDER==1 & data$generation==3, 'Male, Gen X',
ifelse(data$PPGENDER==1 & data$generation==4, 'Male, Millennial',
ifelse(data$PPGENDER==2 & data$generation==1, 'Female, Pre-Boomer',
ifelse(data$PPGENDER==2 & data$generation==2, 'Female, Boomer',
ifelse(data$PPGENDER==2 & data$generation==3, 'Female, Gen X',
'Female, Millennial')))))))

Here is a summary of my code shown above:

  1. I created a new variable named GENERATION.GENDER (that’s why I have data$GENERATION.GENDER at the beginning of the code).
  2. Because we have 8 categories at the intersection of the two variables (i.e., generation and PPGENDER), we write ifelse( ) conditions 7 times.
  3. The final category doesn’t need a condition because if the 7 conditions, referring to 7 categories, do not match for a respondent, they are going to be assigned to the final category (which is ‘Female, Millennial’).

Let’s check the levels of the GENERATION.GENDER variable:

table(data$GENERATION.GENDER)

Task 7: Creating a summary statistics table

Now, let’s do some real data analysis! 😁

Let’s pretend, we want to know how financial wellbeing — indicated by a variable(FWBScore) on a scale ranging from 0 and 100 — differs at the intersection of generation and gender.

Here is the plan:

  1. Use the group_by( ) function from the dplyr package. We want to group respondents based on the levels of the GENERATION.GENDER variable we created.
  2. We want these statistics for each level of GENERATION.GENDER: the number of respondents, mean FWBscore, median FWBscore, and standard deviation of FWBscore. We use these functions: n( ) → number of people; mean() → mean; median () → median; sd() → standard deviation.
  3. We use the summarise( ) function to get the above summary statistics.
  4. We use the round( ) function to round numbers to one place after the decimal.
  5. We put the table inside a dataframe called table.

Here is the code:

table <- data %>% group_by(GENERATION.GENDER) %>%
summarise(Count=n(),
Mean.FWBScore=round(mean(FWBscore),digits=1),
Median.FWBScore=round(median(FWBscore),digits=1),
SD.FWBScore=round(sd(FWBscore),digits=1)
)

The above table is interesting, right?

Looks like, the oldest male respondents are financially the happiest, whereas the youngest female respondents are financially least happy. The intersection, appeatently, matters.

A better way to present the above table would be to show the same in a barplot.

Let’s create a barplot!

Task 8: Creating a barplot

We are going to use the table we created earlier to make a barplot.

Here is the plan:

  1. Use the ggplot( ) function from the ggplot2 package.
  2. Inside the ggplot ( ) function, first, put the name of the dataframe. The name of the dataframe here is table (as I named it earlier).
  3. Inside the aes( ) function, after “x = ”, put the value of the variable based on which your groups are categorized. In this case, the variable is GENERATION.GENDER, and therefore, I type x=GENERATION.GENDER. Similarly, after “y= ”, put the value of the outcome variable. In this case, the variable is Mean.FWBScore.
  4. Inside the geom_bar ( ) function, type stat= “identity”.
install.packages("ggplot2") #Installs the package

library(ggplot2)

library(dplyr) #Loads the package

ggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore)) +
geom_bar(stat = "identity")

Okay, here is how my basic barplot looks:

I want to improve it by doing the following:

  1. Flip the graph using the coord_flip( ) function.
  2. Change the theme to theme_light( ) function.
  3. Change the y-axis label.
  4. No need for the x-axis label as the categories are self-explanatory.

I do the above by running the following code:

ggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore)) + 
geom_bar(stat = "identity")+
coord_flip()+
theme_light()+
labs(y="Average Financial Well-Being Score", x=" ")

Looks fairly decent!

Extra Task: Creating a grouped/stacked barplot

Sometimes, we want to create a grouped or a stacked barplot.

For example, in the barplot we created above, we may want to incorporate household income (PPINCIMP) variable into the analysis. To do that, first, we need to incorporate PPINCIMP into the table:

table <- data %>% group_by(GENERATION.GENDER,PPINCIMP) %>%
summarise(Count=n(),
Mean.FWBScore=round(mean(FWBscore),digits=1),
Median.FWBScore=round(median(FWBscore),digits=1),
SD.FWBScore=round(sd(FWBscore),digits=1)
)

Next, we do the following:

ggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore,fill=factor(PPINCIMP))) +
geom_bar(stat="identity",position = "dodge")+
coord_flip()+
labs(x=" ",y="Average Financial Wellbeing Score",fill="Household Income")+
theme_minimal()

To convert the above grouped bar plot into a stacked bar plot, we make a small change into the code. We change the position = “dodge” to position = “stack”.

ggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore,fill=factor(PPINCIMP))) +
geom_bar(stat="identity",position = "stack")+
coord_flip()+
labs(x=" ",y="Average Financial Wellbeing Score",fill="Household Income")+
theme_minimal()

The stacked version of the barplot is harder to understand. And I would prefer presenting the grouped barplot in the context of our analysis.

Before we end, you may have noticed that the income categories aren’t ordered properly. This is because the levels of factors are, by default, organized alphanumerically.

Below, I present the code to recreate the group barplot with income categories shown from lowest to highest:

#Reording the levels of the PPINCIMP variable
table <- table %>% mutate(PPINCIMP = factor(PPINCIMP, levels = c("Less than $20,000",
"$20,000 to $29,999",
"$30,000 to $39,999",
"$40,000 to $49,999",
"$50,000 to $59,999",
"$60,000 to $74,999",
"$75,000 to $99,999",
"$100,000 to $149,999",
"$150,000 or more")))

#Creating a grouped barplot
ggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore,fill=PPINCIMP)) +
geom_bar(stat="identity",position = "dodge")+
coord_flip()+
labs(x=" ",y="Average Financial Wellbeing Score",fill="Household Income")+
theme_minimal()

A “smarter” way (i.e., utilizing the pipe operator from the dplyr package strategically) to write the above can be the following:

table %>% mutate(PPINCIMP = factor(PPINCIMP, levels = c("Less than $20,000",
"$20,000 to $29,999",
"$30,000 to $39,999",
"$40,000 to $49,999",
"$50,000 to $59,999",
"$60,000 to $74,999",
"$75,000 to $99,999",
"$100,000 to $149,999",
"$150,000 or more"))) %>%
ggplot(aes(x=GENERATION.GENDER, y=Mean.FWBScore,fill=PPINCIMP)) +
geom_bar(stat="identity",position = "dodge")+
coord_flip()+
labs(x=" ",y="Average Financial Wellbeing Score",fill="Household Income")+
theme_minimal()

Below, I present the links to parts 2 and 3 of this crash course:

** Except otherwise noted, all the pictures used in this article are screenshots from RStudio based on the author’s code**

** If you found this article helpful, please consider following me! Also, please consider becoming an Email Subscriber so that you receive an email next time I post something!**

--

--

Vivekananda Das

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Wellbeing | Ph.D. candidate @UW-Madison | More: https://vivekanandadas.com