# Analyzing Survey Data in R: A Crash Course (Part 1)

## A step-by-step lesson

--

In this crash course, I will share the **ten most common tasks for survey data analysis in R**. I plan to share these tasks in two parts.

This course is most suitable for beginners who want to quickly learn coding in R to begin survey data analysis for their research.

I am assuming you know some fundamentals of R. For example, you know the following operators:

<- (assignment)

= (assignment)

== (equal)

> (greater than)

< (less than)

>= (greater than or equal)

<= (less than or equal)

!= (not equal)

| (or)

& (and)

# Before we get started

Throughout this course, I will use publicly available real-world datasets. Also, I am assuming you have RStudio installed on your computer. In case you don’t, please use the cloud version of RStudio.

This article discusses the following:

Task 0:Importing a .csv file directly from the web

Task 1:Creating a subset

Task 2:Creating a binary (aka dummy) variable

Task 3:Renaming the levels of a categorical variable

Task 4:Creating a new categorical variable

Task 5:Creating a summary statistics table

Task 6:Creating a barplot

# Task 0: Importing a .csv file directly from the web

In this article, I am going to use data from the National Financial Well-Being Survey, conducted by the Consumer Financial Protection Bureau.

To download any .csv file openly available on the internet, we can use the **read.csv( )** function. We do two things here:

- Put the URL inside the
**read.csv ( )**function. - Name the dataframe. For example, I named it
**data**.

`data <- read.csv("https://www.consumerfinance.gov/documents/5614/NFWBS_PUF_2016_data.csv")`

# Task 1: Creating a subset

Oftentimes, the target population of our research interest is a subgroup, rather than the whole population.

For example, let’s pretend, we are interested in households with less than $50,000 annual household income. In other words, we want to analyze data from only those respondents who lived in a household with less than $50,000 annual income.

To get this subset, first I need to look at the survey codebook and see how the income variable has been coded.

So, in this dataset, the income variable has been named **PPINCIMP**. Also, selecting levels 1, 2, 3, and 4 would provide us with the responses for the respondents living in households with less than $50,000 annual income.

To get this subset, we are going to use the **filter( )** function from the **dplyr **package. Let’s install the package, load it, and run the code:

`install.packages("dplyr") #Installs the package`

library(dplyr) #Loads the package

income50k <- data %>% filter(PPINCIMP<=4) #Gets the <$50k income subset

Once you run the above code, you will see the following:

So, now, we have the main dataframe (**data**) and the subset that has only lower-income respondents (**income50k**).

You can check the income categories inside this new dataframe:

`table(income50k$PPINCIMP)`

The **income50** dataset only has respondents with less than $50,000 annual household income.

# Task 2: Creating a binary (aka dummy) variable

Let’s pretend, we are not interested in all the education categories that our dataset provides. Rather we want to know something about *people with less than a college degree* and *people with a college degree or above*.

In other words, we are thinking about a binary education variable which would take a value of *1 if a respondent has a college degree or above* and a *0 if their educational attainment is less than a college degree*.

Let’s check the levels of the **PPEDUC** variable:

`table(data$PPEDUC)`

Let’s understand what these numbers refer to based on the information provided in the codebook:

Okay, here is the plan:

- Name the new variable COLLEGE.
- COLLEGE takes a value of 1 if PPEDUC >= 4; COLLEGE takes a value of 0 if PPEDUC < 4.

Let’s use the **ifelse( )** function to create this new variable:

`data$COLLEGE<- ifelse(data$PPEDUC>=4,1,0)`

Let’s see the levels of **COLLEGE**:

Awesome! 😊

# Task 3: Renaming the levels of a categorical variable

Let’s check the PPINCIMP variable’s coding one more time:

Note that the levels of the **PPINCIMP **variable are 1, 2, ……, and 9, which aren’t very helpful. Therefore, we would like to rename the levels to make our analyses more comprehensible.

To rename the levels of this categorical variable, we can use the **recode( ) **function.

Here is the plan:

- Inside the recode ( ) function, first, we enter this: name of the data frame(
**data**) $ name of the column (**PPINCIMP**). This is what we have on the left-hand side of the assignment operator (<-) as well. We are doing this because we want to rename the levels of the**PPINCIMP**variable which is located inside the**data**dataframe as a column. - Next, we follow something like this: “Old name of a level” = “New name of the level”. And we do the same for each level.

`data$PPINCIMP <- recode(data$PPINCIMP,"1" = "Less than $20,000", `

"2"=" $20,000 to $29,999",

"3"=" $30,000 to $39,999",

"4"="$40,000 to $49,999 ",

"5"=" $50,000 to $59,999",

"6"=" $60,000 to $74,999",

"7"=" $75,000 to $99,999",

"8"=" $100,000 to $149,999",

"9"=" $150,000 or more")

table(data$PPINCIMP)

It worked! 😊

# Task 4: Creating a new categorical variable

Now let’s pretend, we are interested in people at the intersection of their gender and generation. Let’s have a quick look at these two variables individually.

`table(data$PPINCIMP)`

table(data$PPGENDER)

So, **generation **has two levels. Let’s check the codebook to learn what these numbers mean.

Again, check the codebook to learn what the levels of **PPGENDER **mean.

Okay, so the **generation** variable has 4 levels and the **PPGENDER** variable has 2 levels. At the intersection of the two variables, there will be 4 x 2 = 8 levels.

However, the dataset doesn’t provide us with a separate variable with these 8 levels. Let’s create it ourselves!

This time I am going to use the **ifelse ( ) **function.

*Note that, the following code may seem a bit challenging. But nothing to worry about! You don’t have to memorize anything. Just copy and paste the following code, tweak the variable names depending on the dataset you are using, and you will be all good!*

`data$GENERATION.GENDER <- ifelse(data$PPGENDER==1 & data$generation==1, 'Male, Pre-Boomer',`

ifelse(data$PPGENDER==1 & data$generation==2, 'Male, Boomer',

ifelse(data$PPGENDER==1 & data$generation==3, 'Male, Gen X',

ifelse(data$PPGENDER==1 & data$generation==4, 'Male, Millennial',

ifelse(data$PPGENDER==2 & data$generation==1, 'Female, Pre-Boomer',

ifelse(data$PPGENDER==2 & data$generation==2, 'Female, Boomer',

ifelse(data$PPGENDER==2 & data$generation==3, 'Female, Gen X',

'Female, Millennial')))))))

Here is a summary of my code shown above:

- I created a new variable named
**GENERATION.GENDER**(that’s why I have data$GENERATION.GENDER at the beginning of the code). - Because we have 8 categories at the intersection of the two variables (i.e.,
**generation**and**PPGENDER**), we write**ifelse( )**conditions 7 times. - The final category doesn’t need a condition because if the 7 conditions, referring to 7 categories, do not match for a respondent, they are going to be assigned to the final category (which is ‘
*Female, Millennial*’).

Let’s check the levels of the **GENERATION.GENDER** variable:

`table(data$GENERATION.GENDER)`

# Task 5: Creating a summary statistics table

Now, let’s do some real analysis! 😁

Let’s pretend, we want to know how financial wellbeing — indicated by a variable(**FWBScore**) on a scale of 0 to 100 — differs at the intersection of generation and gender.

Here is the plan:

- Use the
**group_by( )**function from the**dplyr**package. We want to group respondents based on the levels of the**GENERATION.GENDER**variable we created. - We want these statistics for each level of
**GENERATION.GENDER**: the number of respondents, mean FWBscore, median FWBscore, and standard deviation of FWBscore. We use these functions:**n( ) →**number of people**; mean() →**mean**; median () →**median**; sd() →**standard deviation**.** - We use the
**summarise( )**function to get the above summary statistics. - We use the
**round( )**function to round numbers to one place after the decimal. - We put the table inside a dataframe called
**table.**

Here is the code:

`table <- data %>% group_by(GENERATION.GENDER) %>%`

summarise(Count=n(),

Mean.FWBScore=round(mean(FWBscore),digits=1),

Median.FWBScore=round(median(FWBscore),digits=1),

SD.FWBScore=round(sd(FWBscore),digits=1)

)

The above table is interesting, right?

Looks like, the oldest male respondents are financially the happiest, whereas the youngest female respondents are financially least happy. The intersection matters.

A better way to present the above table would be to show the same in a barplot.

Let’s create a barplot!

# Task 6: Creating a barplot

We are going to use the table we created earlier to make a barplot.

Here is the plan:

- Use the
**ggplot( )**function from the**ggplot2**package. - Inside the
**ggplot ( )**function, first, put the name of the dataframe. The name of the dataframe here is**table**(as I named it earlier)**.** - Inside the
**aes( )**function, after “x = ”, put the value of*the variable based on which your groups are categorized*. In this case, the variable is GENERATION.GENDER, and therefore, I type x=GENERATION.GENDER. Similarly, after “y= ”, put the value of the*outcome variable*. In this case, the variable is**Mean.FWBScore.** - Inside the
**geom_bar ( )**function, type stat= “identity”.

`install.packages("ggplot2") #Installs the package`

library(ggplot2)

library(dplyr) #Loads the package

ggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore)) +

geom_bar(stat = "identity")

Okay, here is how my basic barplot looks:

I want to improve it by doing the following:

- Flip the graph using the
**coord_flip( )**function. - Change the theme to
**theme_light( )**function. - Change the y-axis label.
- No need for the x-axis label as the categories are self-explanatory.

I do the above by running the following code:

`ggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore)) + `

geom_bar(stat = "identity")+

coord_flip()+

theme_light()+

labs(y="Average Financial Well-Being Score", x=" ")

Looks fairly decent. And this is all we have for today!

Part 2 of the crash course is available here:

*** Except otherwise noted, all the pictures used in this article are screenshots from RStudio based on the author’s code***

*** If you found this article helpful, please consider following me! Also, please consider becoming an Email Subscriber so that you receive an email next time I post something!***