Visualize Gapminder Data in R: A Step-by-Step Tutorial

Learn to visualize the impact of real-world events

Vivekananda Das
6 min readNov 4, 2021
Photo by Adeolu Eletu on Unsplash

After finishing some introductory courses, I was looking for some real-world datasets which are fun to work with but, at the same time, contain some challenges to deal with. And I could not have found a better source than Gapminder! I am quoting their goal directly from their website:

Gapminder fights devastating misconceptions and promotes a fact-based worldview everyone can understand.

In case you are not yet familiar with Gapminder, here is the link to their website: https://www.gapminder.org/data/

In this article, using multiple packages in R, I will explain how to visualize (also, how to tell a story) with the data freely available on the gapminder website.

As an example, we will try to visualize how the per capita GDP — one of the critical indicators of economic growth — of India and China evolved over the last 70 years.

Step 1: Download the data

I download the GDP/capita(US$,inflation-adjusted) dataset as a .csv file directly from their website.

Step 2: Load the data in R

Next, I create a folder in which my R code (.R file) and the downloaded data named gdppercapita_us_inflation_adjusted are located. Then, in R, I click Session → Set Working Directory → To Source File Location. And we are all set to start coding!

#Load the data and name it g (or whatever you wish to name it!)
g<-read.csv("gdppercapita_us_inflation_adjusted.csv")

Also, load all the required packages:

library(dplyr)
library(ggplot2)
library(tidyr)

Now, let’s have a quick look at the dataframe g to determine our next steps:

Looking at the dataframe, I decide to do the following:

  1. Change the first column’s name(not sure why it got messed up like this). We will name it Country. This step is not necessary, though!
#Changing the name of the first column
colnames(g)[colnames(g) == "ï..country"] <- "Country"

Okay, so we learned how to change a column name in R!

2. The dataframe is presented in a wide format (which means it has more columns than rows). Ideally, we want it in a long format (which means it should have more rows than columns). We do that by creating a new column named Year, and for each Country and Year, we want the value of per capita GDP to be under the Per_Capita_GDP column. I use the pivot_longer () function from the tidyr package for converting the data from wide to long format.

#Wider to Longer Format
g2 <- g %>%
pivot_longer(
cols = starts_with("X"),
names_to = "Year",
names_prefix = "X",
values_to = "Per_Capita_GDP",
values_drop_na = TRUE
)

And, we learned how to change the shape of a dataframe in R!

3. Next, I want to change the K to thousand (1000).

#Changing the k to thousand (1000)
g2$Per_Capita_GDP<-as.numeric(sub("k", "e3", g2$Per_Capita_GDP, fixed = TRUE))

Cool! Our dataframe is ready for visualization!

Step 3: Select India and China

#Selecting India and China [filter () function from the dplyr package]
g3<-g2%>% filter(Country=='India'|Country=='China')

Step 4: Create a Line Chart

ggplot(data=g3, aes(x=Year, y=Per_Capita_GDP, group=Country,color=Country)) +
geom_line(size=2.5)

Well, it worked, but we have a lot to improve! Let’s fix things one by one:

  1. This dataframe contains data from 1959 to 2019. We do not want to show all these years on the x-axis as we need more space. Instead, let’s mention only specific years.

(From now on, I write the extra codes, on top of the existing code, in bold)

ggplot(data=g3, aes(x=Year, y=Per_Capita_GDP, group=Country,color=Country)) + geom_line(size=2.5)+scale_x_discrete(breaks=c(1960,1970,1980,1990,2000,2010,2019))

2. We should add a title to the graph and add labels to the x and y axes.

ggplot(data=g3, aes(x=Year, y=Per_Capita_GDP, group=Country,color=Country)) +
geom_line(size=2.5)+scale_x_discrete(breaks=c(1960,1970,1980,1990,2000,2010,2019))+ xlab("Year")+ylab("GDP/Capita, PPP$ Inflation-Adjusted")+ggtitle("Comparing Per Capita GDP of India and China over the Years")

3. Let’s bring the title to the middle (by typing hjust=0.5 inside the theme function). Also, we can change the font sizes of the title and the axes labels. Additionally, we can change the panel background from the default gray to white and have only horizontal grid lines.

ggplot(data=g3, aes(x=Year, y=Per_Capita_GDP, group=Country,color=Country)) +
geom_line(size=2.5)+scale_x_discrete(breaks=c(1960,1970,1980,1990,2000,2010,2019))+
xlab("Year")+ylab("GDP/Capita, PPP$ Inflation-Adjusted")+ggtitle("Comparing Per Capita GDP of India and China over the Years")+
theme(plot.title = element_text(face = "bold", size = 25, hjust = 0.5), axis.text.x = element_text(size=15),
axis.text.y = element_text(size=15),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
legend.text = element_text(size=15),
legend.title = element_text(size=15),
panel.background = element_rect(fill = "white"),
panel.grid.major.y = element_line(colour = "grey10"))

4. If we want to change the colors of the lines:

ggplot(data=g3, aes(x=Year, y=Per_Capita_GDP, group=Country,color=Country)) +
geom_line(size=2.5)+scale_x_discrete(breaks=c(1960,1970,1980,1990,2000,2010,2019))+
xlab("Year")+ylab("GDP/Capita, PPP$ Inflation-Adjusted")+ggtitle("Comparing Per Capita GDP of India and China over the Years")+
theme(plot.title = element_text(face = "bold", size = 25, hjust = 0.5),
axis.text.x = element_text(size=15),
axis.text.y = element_text(size=15),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
legend.text = element_text(size=15),
legend.title = element_text(size=15),
panel.background = element_rect(fill = "white"),
panel.grid.major.y = element_line(colour = "grey10")) +
scale_color_manual(values=c("#CC6666", "#9999CC"))

Okay! Now, let’s try to understand the story hidden in the data. From 1959 to 1975, the per capita GDP of both countries followed an almost parallel trend; however, something happened in China close to 1980, and their economy has been booming ever since!

After quick research on this, I found that China’s stunning economic success was led by its visionary leader Deng Xiaoping, who rose to preeminence in 1977.

(To learn more: https://www.britannica.com/biography/Deng-Xiaoping.)

Back to our graph! Let’s finish it off by adding a vertical line on 1977 and mentioning the historical event of Xiaoping’s rise to power:

ggplot(data=g3, aes(x=Year, y=Per_Capita_GDP, group=Country,color=Country)) +
geom_line(size=2.5)+scale_x_discrete(breaks=c(1960,1970,1980,1990,2000,2010,2019))+
xlab("Year")+ylab("GDP/Capita, PPP$ Inflation-Adjusted")+ggtitle("Comparing Per Capita GDP of India and China over the Years")+
theme(plot.title = element_text(face = "bold", size = 25, hjust = 0.5),
axis.text.x = element_text(size=15),
axis.text.y = element_text(size=15),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
legend.text = element_text(size=15),
legend.title = element_text(size=15),
panel.background = element_rect(fill = "white"),
panel.grid.major.y = element_line(colour = "grey10")) +
scale_color_manual(values=c("#CC6666", "#9999CC"))+
geom_vline(xintercept = 19, linetype="dashed", color = "red", size=1.5)+
annotate(geom="text", x=35, y=5000, label="Deng Xiaoping's Rise to Preeminance", color="red",size=6)

Obviously, there are more things that we can do with this data. But, for now, we end right here.

I hope this article piqued your curiosity to explore, analyze, and visualize more real-world events with data!

--

--

Vivekananda Das

Sharing synthesized ideas on Data Analysis in R, Data Literacy, Causal Inference, and Wellbeing | Ph.D. candidate @UW-Madison | More: https://vivekanandadas.com