Visualizing Continuous Data with ggplot2 in R

A Step by Step Tutorial

Photo by Luke Chesser on Unsplash

In this article, we will discuss how to visualize the distribution of a continuous variable using the ggplot2 package in R. To be more specific, we are going to learn how to make histograms, density plots, box plots, ridgeline plots, and violin plots in R — all in this one 5 minute lesson!

For our purpose, rather than using a fake dataset, we are going to use the Financial Well-Being Survey 2016 data. This survey was conducted by the Consumer Financial Protection Bureau (CFPB) and the data is publicly available here:(https://www.consumerfinance.gov/data-research/financial-well-being-survey-data/).

Let’s download the data directly from the CFPB website and get started!

data <- read.csv("https://www.consumerfinance.gov/documents/5614/NFWBS_PUF_2016_data.csv")

Data Visualization Plan

We want to investigate how the distribution of Financial Well-Being (FWB) score varies across different groups of people. Long story short, the FWB score can range from 0 to 100 and a higher score indicates a higher financial well-being.

Before we start visualizing the distribution of the FWB score, we can have a quick look at the summary of it:

summary(data$FWBscore)

The negative FWB scores are due to non-responses in the survey and we want to get rid of them.

data2<-data[data$FWBscore>=0,]
summary(data2$FWBscore)

Great! Now, in our data, the minimum score is 14 (financially least happy), the maximum score is 95 (financially happiest), and the mean score is 56.08. By eliminating the responses with negative FWB scores, we lost just 5 out of 6394 responses in the original dataset. Not bad!

The summary function gives us some preliminary idea on the distribution of the data. For example, as the mean and median are pretty close, the distribution should be fairly symmetrical. We can confirm by quickly creating a histogram using the base R function hist.

hist(data2$FWBscore)

Although the above histogram conveys many of the necessary information that we need, aesthetically, it can be improved to a great extent. So, let’s load all the packages required for this lesson:

library(ggplot2) 
library(ggridges) #for creating ridgeline plots
library(dplyr) #for recoding levels of a categorical variable

Basic Histogram

In my cheat sheet, I wrote down the following general code for creating a basic histogram with ggplot2:

ggplot(data_name, aes(x=continous_variable_name)) + 
geom_histogram()

Next, on this basic code, we can add more stuffs to make the graph more appealing, accessible, and meaningful. For example, inside geom_histogram( ), we can put different values of binwidth, fill, color, and alpha. Also, for a professional presentation, we will label the axes and the legend and create a title of the graph by putting values inside labs( ).

To be honest, when I first started learning data visualization with ggplot2, these lengthy codes seemed overwhelming and I wondered “why on earth should I use this package rather than using other softwares in which I can just click buttons to produce the same thing?”. However, with time, I realized that once we figure out the fundamental grammar of ggplot2, we can mostly copy paste stuffs and make little changes here and there to create beautiful figures.

Now, for the purpose of this lesson, this is how the code should be:

ggplot(data2, aes(x=FWBscore)) + 
geom_histogram()

Let’s improve the histogram by i) selecting a different bin size (in the above histogram bins =30 was used by default), ii) changing fill and color, and iii) adding a title to the graph.

ggplot(data2, aes(x=FWBscore)) + 
geom_histogram(binwidth=1, fill="#FF9999", color="#e9ecef", alpha=0.9) +
labs(title = "Distribution of FWB Score in the CFPB Financial Well-being Survey 2016 Data")

Groupwise Histogram

Next, we would like to see how the distribution of FWB score varies based on gender. In this dataset, PPGENDER==1 means male and PPGENDER==2 means female.

The general code for groupwise histogram:

ggplot(data_name, aes(x=continous_variable_name, fill=factor(categorical_variable_name))) + 
geom_histogram()

In our current context, we use the following code:

ggplot(data2, aes(x=FWBscore,fill=factor(PPGENDER))) + 
geom_histogram(binwidth=1, alpha=0.5, position = 'identity') +
labs(title = "Distribution of FWB Score for Men and Women in the CFPB Financial Well-being Survey 2016 Data",
fill = "Group")+
scale_fill_discrete(labels = c("Male", "Female"))

From the above graph, it seems that the shapes of the distribution of FWB score for the male and female groups are quite identical but the earlier group, apparently, has a slightly higher median FWB score. Now, we can create a density plot either separately or right on top of the histogram to get a little bit more sense of the two distributions.

Density Plot

#Density plot only
ggplot(data2, aes(x=FWBscore,y=..density..,fill=factor(PPGENDER))) +
geom_density(alpha=0.25) +
labs(title = "Distribution of FWB Score for Men and Women in the CFPB Financial Well-being Survey 2016 Data",fill = "Group")+
scale_fill_discrete(labels = c("Male", "Female"))
#Density plot on top of a histogram
ggplot(data2, aes(x=FWBscore,y=..density..,fill=factor(PPGENDER))) +
geom_histogram(position='dodge', binwidth=1) +
geom_density(alpha=0.25) +
labs(title = "Distribution of FWB Score for Men and Women in the CFPB Financial Well-being Survey 2016 Data",
fill = "Group")+
scale_fill_discrete(labels = c("Male", "Female"))

Box Plot

Next, we can create box plots for getting a more precise understanding of the differences between the two distributions.

ggplot(data2, aes(x=FWBscore, y = factor(PPGENDER), fill = factor(PPGENDER))) +
geom_boxplot() +
labs(title= "Distribution of FWB Score for Men and Women in the CFPB Financial Well-being Survey 2016 Data",
fill = "Group",
y =" ") +
scale_fill_discrete(labels = c("Male", "Female"))

Interestingly, we can introduce another variable into this descriptive analysis by updating the above boxplot. For example, we can explore whether FWB score changes for male and female from different races/ethnicities. In this survey data, the coding of race/ethnicity is done this way: PPETHM ==1 (White, Non-hispanic); PPETHM == 2 (Black, Non-hispanic); PPETHM ==3 (Other, Non hispanic); PPETHM ==4 (Hispanic).

#Recoding the PPETHM variable
data2$PPETHM <- recode_factor(data2$PPETHM, "1" = "White",
"2" = "Black",
"3" = "Other",
"4" = "Hispanic")
ggplot(data2, aes(x=FWBscore, y = factor(PPETHM), fill = factor(PPGENDER))) +
geom_boxplot() +
labs(title= "Distribution of FWB Score for Men and Women from Different Racial Groups \n in the CFPB Financial Well-being Survey 2016 Data", fill = "Group", y =" ") +
scale_fill_discrete(labels = c("Male", "Female"))

We can also show the shapes of the distribution of FWB scores of these eight different groups using a ridgeline plot and a violin plot. Fortunately, we just have to copy the code for the boxplot and change the geom_boxplot to geom_density_ridges or geom_violin and we are done!

Ridgeline Plot

ggplot(data2, aes(x = FWBscore, y = factor(PPETHM), fill = factor(PPGENDER))) +
geom_density_ridges() +
labs(title= "Distribution of FWB Score for Men and Women in the CFPB Financial Well-being Survey 2016 Data",
fill = "Group",
y =" ") +
scale_fill_discrete(labels = c("Male", "Female"))

Violin Plot

ggplot(data2, aes(x = FWBscore, y = factor(PPETHM), fill = factor(PPGENDER))) +
geom_violin() +
labs(title= "Distribution of FWB Score for Men and Women in the CFPB Financial Well-being Survey 2016 Data",
fill = "Group",
y =" ") +
scale_fill_discrete(labels = c("Male", "Female"))

To conclude, not all visualizations are useful in all contexts. So, as an analyst, you have to figure out which visualization is the most effective one for an easier communication of the story that you are trying to tell. Obviously, you can do lots of other modifications on the above plots to make them look prettier and more informative. But, I hope this article helped a bit as a starting point in your data visualization learning journey!

--

--

--

Sharing ideas on Cause-and-Effect, Data Analysis in R, Stat Literacy, and Happiness | Ph.D. student @UW-Madison | More: https://vivekanandadas.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Journalism In The Data-Driven World: A Simple Math Guide

Making a Modern Data Centric Organization — Part 3

How I wrote a complex SQL query in 12 hours

Social network analysis — Part 2

Create Line Charts in R using ggplot2

5 Tips for Surviving Data Science Bootcamp

INTELLIGENZE MULTIPLE E INTELLIGENZA EMOTIVA

Steps to Stop Procrastinating while learning Data Science.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vivekananda Das

Vivekananda Das

Sharing ideas on Cause-and-Effect, Data Analysis in R, Stat Literacy, and Happiness | Ph.D. student @UW-Madison | More: https://vivekanandadas.com

More from Medium

R Shiny reactivity

Data Analysis with R — Part 6 (Handling Strings)

EP-Edina, MN Home values in 2022: Exploratory Data Analysis (EDA) in R programming .