# Analyzing Survey Data in R: A Crash Course (Part 1)

## A step-by-step lesson

In this crash course, I will share the ten most common tasks for survey data analysis in R. I plan to share these tasks in two parts.

This course is most suitable for beginners who want to quickly learn coding in R to begin survey data analysis for their research.

I am assuming you know some fundamentals of R. For example, you know the following operators:

<- (assignment)

= (assignment)

== (equal)

> (greater than)

< (less than)

>= (greater than or equal)

<= (less than or equal)

!= (not equal)

| (or)

& (and)

# Before we get started

Throughout this course, I will use publicly available real-world datasets. Also, I am assuming you have RStudio installed on your computer. In case you don’t, please use the cloud version of RStudio.

Task 0: Importing a .csv file directly from the web

Task 2: Creating a binary (aka dummy) variable

Task 3: Renaming the levels of a categorical variable

Task 4: Creating a new categorical variable

Task 5: Creating a summary statistics table

# Task 0: Importing a .csv file directly from the web

In this article, I am going to use data from the National Financial Well-Being Survey, conducted by the Consumer Financial Protection Bureau.

To download any .csv file openly available on the internet, we can use the read.csv( ) function. We do two things here:

1. Put the URL inside the read.csv ( ) function.
2. Name the dataframe. For example, I named it data.
`data <- read.csv("https://www.consumerfinance.gov/documents/5614/NFWBS_PUF_2016_data.csv")`

# Task 1: Creating a subset

Oftentimes, the target population of our research interest is a subgroup, rather than the whole population.

For example, let’s pretend, we are interested in households with less than \$50,000 annual household income. In other words, we want to analyze data from only those respondents who lived in a household with less than \$50,000 annual income.

To get this subset, first I need to look at the survey codebook and see how the income variable has been coded.

So, in this dataset, the income variable has been named PPINCIMP. Also, selecting levels 1, 2, 3, and 4 would provide us with the responses for the respondents living in households with less than \$50,000 annual income.

To get this subset, we are going to use the filter( ) function from the dplyr package. Let’s install the package, load it, and run the code:

`install.packages("dplyr") #Installs the packagelibrary(dplyr) #Loads the packageincome50k <- data %>% filter(PPINCIMP<=4) #Gets the <\$50k income subset`

Once you run the above code, you will see the following:

So, now, we have the main dataframe (data) and the subset that has only lower-income respondents (income50k).

You can check the income categories inside this new dataframe:

`table(income50k\$PPINCIMP)`

The income50 dataset only has respondents with less than \$50,000 annual household income.

# Task 2: Creating a binary (aka dummy) variable

Let’s pretend, we are not interested in all the education categories that our dataset provides. Rather we want to know something about people with less than a college degree and people with a college degree or above.

In other words, we are thinking about a binary education variable which would take a value of 1 if a respondent has a college degree or above and a 0 if their educational attainment is less than a college degree.

Let’s check the levels of the PPEDUC variable:

`table(data\$PPEDUC)`

Let’s understand what these numbers refer to based on the information provided in the codebook:

Okay, here is the plan:

1. Name the new variable COLLEGE.
2. COLLEGE takes a value of 1 if PPEDUC >= 4; COLLEGE takes a value of 0 if PPEDUC < 4.

Let’s use the ifelse( ) function to create this new variable:

`data\$COLLEGE<- ifelse(data\$PPEDUC>=4,1,0)`

Let’s see the levels of COLLEGE:

Awesome! 😊

# Task 3: Renaming the levels of a categorical variable

Let’s check the PPINCIMP variable’s coding one more time:

Note that the levels of the PPINCIMP variable are 1, 2, ……, and 9, which aren’t very helpful. Therefore, we would like to rename the levels to make our analyses more comprehensible.

To rename the levels of this categorical variable, we can use the recode( ) function.

Here is the plan:

1. Inside the recode ( ) function, first, we enter this: name of the data frame(data) \$ name of the column (PPINCIMP). This is what we have on the left-hand side of the assignment operator (<-) as well. We are doing this because we want to rename the levels of the PPINCIMP variable which is located inside the data dataframe as a column.
2. Next, we follow something like this: “Old name of a level” = “New name of the level”. And we do the same for each level.
`data\$PPINCIMP <-  recode(data\$PPINCIMP,"1" = "Less than \$20,000",                          "2"=" \$20,000 to \$29,999",                         "3"=" \$30,000 to \$39,999",                         "4"="\$40,000 to \$49,999 ",                         "5"=" \$50,000 to \$59,999",                         "6"=" \$60,000 to \$74,999",                         "7"=" \$75,000 to \$99,999",                         "8"=" \$100,000 to \$149,999",                         "9"=" \$150,000 or more")table(data\$PPINCIMP)`

It worked! 😊

# Task 4: Creating a new categorical variable

Now let’s pretend, we are interested in people at the intersection of their gender and generation. Let’s have a quick look at these two variables individually.

`table(data\$PPINCIMP)table(data\$PPGENDER)`

So, generation has two levels. Let’s check the codebook to learn what these numbers mean.

Again, check the codebook to learn what the levels of PPGENDER mean.

Okay, so the generation variable has 4 levels and the PPGENDER variable has 2 levels. At the intersection of the two variables, there will be 4 x 2 = 8 levels.

However, the dataset doesn’t provide us with a separate variable with these 8 levels. Let’s create it ourselves!

This time I am going to use the ifelse ( ) function.

Note that, the following code may seem a bit challenging. But nothing to worry about! You don’t have to memorize anything. Just copy and paste the following code, tweak the variable names depending on the dataset you are using, and you will be all good!

`data\$GENERATION.GENDER <- ifelse(data\$PPGENDER==1 & data\$generation==1, 'Male, Pre-Boomer',                                           ifelse(data\$PPGENDER==1 & data\$generation==2, 'Male, Boomer',                                                  ifelse(data\$PPGENDER==1 & data\$generation==3, 'Male, Gen X',                                                         ifelse(data\$PPGENDER==1 & data\$generation==4, 'Male, Millennial',                                                                ifelse(data\$PPGENDER==2 & data\$generation==1, 'Female, Pre-Boomer',                                                                       ifelse(data\$PPGENDER==2 & data\$generation==2, 'Female, Boomer',                                                                              ifelse(data\$PPGENDER==2 & data\$generation==3, 'Female, Gen X',                                                                                                                   'Female, Millennial')))))))`

Here is a summary of my code shown above:

1. I created a new variable named GENERATION.GENDER (that’s why I have data\$GENERATION.GENDER at the beginning of the code).
2. Because we have 8 categories at the intersection of the two variables (i.e., generation and PPGENDER), we write ifelse( ) conditions 7 times.
3. The final category doesn’t need a condition because if the 7 conditions, referring to 7 categories, do not match for a respondent, they are going to be assigned to the final category (which is ‘Female, Millennial’).

Let’s check the levels of the GENERATION.GENDER variable:

`table(data\$GENERATION.GENDER)`

# Task 5: Creating a summary statistics table

Now, let’s do some real analysis! 😁

Let’s pretend, we want to know how financial wellbeing — indicated by a variable(FWBScore) on a scale of 0 to 100 — differs at the intersection of generation and gender.

Here is the plan:

1. Use the group_by( ) function from the dplyr package. We want to group respondents based on the levels of the GENERATION.GENDER variable we created.
2. We want these statistics for each level of GENERATION.GENDER: the number of respondents, mean FWBscore, median FWBscore, and standard deviation of FWBscore. We use these functions: n( ) → number of people; mean() → mean; median () → median; sd() → standard deviation.
3. We use the summarise( ) function to get the above summary statistics.
4. We use the round( ) function to round numbers to one place after the decimal.
5. We put the table inside a dataframe called table.

Here is the code:

`table <- data %>% group_by(GENERATION.GENDER) %>%  summarise(Count=n(),             Mean.FWBScore=round(mean(FWBscore),digits=1),             Median.FWBScore=round(median(FWBscore),digits=1),            SD.FWBScore=round(sd(FWBscore),digits=1)  )`

The above table is interesting, right?

Looks like, the oldest male respondents are financially the happiest, whereas the youngest female respondents are financially least happy. The intersection matters.

A better way to present the above table would be to show the same in a barplot.

Let’s create a barplot!

# Task 6: Creating a barplot

We are going to use the table we created earlier to make a barplot.

Here is the plan:

1. Use the ggplot( ) function from the ggplot2 package.
2. Inside the ggplot ( ) function, first, put the name of the dataframe. The name of the dataframe here is table (as I named it earlier).
3. Inside the aes( ) function, after “x = ”, put the value of the variable based on which your groups are categorized. In this case, the variable is GENERATION.GENDER, and therefore, I type x=GENERATION.GENDER. Similarly, after “y= ”, put the value of the outcome variable. In this case, the variable is Mean.FWBScore.
4. Inside the geom_bar ( ) function, type stat= “identity”.
`install.packages("ggplot2") #Installs the packagelibrary(ggplot2)library(dplyr) #Loads the packageggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore)) +   geom_bar(stat = "identity")`

Okay, here is how my basic barplot looks:

I want to improve it by doing the following:

1. Flip the graph using the coord_flip( ) function.
2. Change the theme to theme_light( ) function.
3. Change the y-axis label.
4. No need for the x-axis label as the categories are self-explanatory.

I do the above by running the following code:

`ggplot(table, aes(x=GENERATION.GENDER, y=Mean.FWBScore)) +   geom_bar(stat = "identity")+   coord_flip()+  theme_light()+  labs(y="Average Financial Well-Being Score", x=" ")`

Looks fairly decent. And this is all we have for today!

Part 2 of the crash course is available here:

** Except otherwise noted, all the pictures used in this article are screenshots from RStudio based on the author’s code**