Data science tools
Tutorial on how to make beautiful plots with R
R
ggplot2
ggrepel
gganimate
ggspatial
sf
By Afshine Amidi and Shervine Amidi
Motivation
The Department of Transportation publicly released a dataset that lists flights that occurred in 2015 along with specificities such as delays, flight time and other information. Our previous post detailed the best practices to manipulate data.
This article aims at showing good practices to visualize data using R's most popular libraries. The following are covered:
- plots using
ggplot2
along with customized visualizations withggrepel
- animated plots using
gganimate
- map-based plots with the
sf
andggspatial
libraries with using data coming frommaps
andrnaturalearth
library(ggplot2) # Plots
library(ggrepel) # Nice labels
library(gganimate) # Animations
library(ggspatial);library(sf) # Map plots
library(maps);library(rnaturalearth) # Map data
theme_set(theme_bw()) # Set theme for all plots
A common strategy to make beautiful plots is to have data ready in a data frame and use the columns and rows of that data frame to plot. This tutorial will focus on the plotting part. As to get the data in the specified format, please refer to the associated tutorial and study guide.
Temporal plots
Evolution of number of flights
We want to plot the temporal evolution of flights of major US airlines:

In order to do that, we note the following:
- The
geom_line()
layer takes in a long-format data frame which plots the different temporal lines for each airline. - The legend text is customized using the
theme()
function, for which an extensive description of the API can be found here. - The vertical line is drawn using the
geom_vline()
layer and by indicating the line type with thelinetype
argument. - The appearance of the date labels on the $x$ axis is customized with the
scale_x_date()
layer.
# data (date, airline, nb_flights)
ggplot(data = data) +
geom_line(aes(x = date, y = nb_flights, color = airline)) +
geom_vline(xintercept = as.Date('2015-07-01'),
linetype = 'dashed', alpha = 0.5, color = 'black') +
scale_x_date(date_breaks = '1 month', date_labels = '%b %d') +
labs(x = 'Time', y = 'Number of daily flights', color = 'Airline',
title = 'Number of flights in the US in 2015',
subtitle = 'Top airlines',
caption = 'Source: publicly available data from DoT') +
theme(axis.text.x = element_text(angle = 25, vjust = 0.75),
plot.caption = element_text(vjust = 7)) +
ylim(0, 4000)
Evolution of traffic per airport
We want to plot temporal evolution of traffic in the top Hawaiian airports:

In order to do that, we note the following:
- Time periods have been drawn using the
geom_rect()
layer and by indicating the start and end date of each time period of interest. - Colors can be customized through the
scale_fill_manual()
function.
# data (date, airport, nb_flights), time_periods (xmin, xmax, ymin, ymax, name, color)
ggplot(data = data) +
geom_rect(data = time_periods, alpha = 0.1,
aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = name)) +
geom_line(aes(x = date, y = nb_flights, color = airport)) +
scale_x_date(date_breaks = '1 month', date_labels = '%b %d') +
scale_fill_manual(breaks = time_periods[,'name'], values = time_periods[,'color']) +
labs(x = 'Time', y = 'Number of daily flights', color = 'Airport', fill = 'Time window',
title = 'Number of flights departing from Hawaiian airports in 2015',
subtitle = 'Top 5 Hawaii airports',
caption = 'Source: publicly available data from DoT') +
theme(axis.text.x = element_text(angle = 25, vjust = 0.75), plot.caption = element_text(vjust = 7)) +
ylim(0, 150)
Evolution of delay type
We want to plot the temporal evolution of delays by delay type for the northern part of the US that are more prone to weather-related delays:

In order to do that, we note the following:
- The
geom_bar()
layer along with theposition
parameter set to'stack'
gives a stacked bar chart that represents various kinds of delays cumulatively for a given time period. - The
labels
anddate_labels
parameters within thescale_y_continuous()
andscale_x_date()
functions enable to customize the appearance of labels.
# data (week, delay_type, perc)
ggplot(data = data) +
geom_bar(aes(x = week, y = perc, fill = delay_type),
position = 'stack', stat = 'identity') +
scale_x_date(date_breaks = '1 month', date_labels = '%b %d') +
scale_y_continuous(labels = function(x) paste0(x*100, '%')) +
labs(x = 'Time', y = 'Percentage of weekly flights', fill = 'Delay type',
title = 'Breakdown of delay type of flights in the northern part of the US in 2015',
subtitle = 'States that are part of the analysis include AK, IL, IN, MA, ME, MI, MN, NH, NY, VT',
caption = 'Source: publicly available data from DoT') +
theme(axis.text.x = element_text(angle = 25, vjust = 0.75), plot.caption = element_text(vjust = 7))
Visualizing across multiple dimensions
Delay by time of the week for top airlines
We want to plot the percentage of delayed flights by hour of the day and day of the week for the top airlines:

In order to do that, we note the following:
- The
facet_grid()
layer has been used to produce a plot that is distinct for each day of the week. In an equivalent way, we could also use thefacet_wrap()
function and set the number of columnsncol
to 7 to have the same looking result.
# data (day_of_week, hour, airline, perc_delay), time_periods (xmin, xmax, ymin, ymax, name, color)
ggplot(data = data) +
geom_rect(data = time_periods, alpha = 0.2,
aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = name)) +
geom_line(aes(x = hour, y = perc_delay, color = airline), size = 0.5) +
geom_point(aes(x = hour, y = perc_delay, color = airline), size = 0.7) +
facet_grid(~day_of_week) +
scale_x_continuous(limits = c(0,24), breaks = 8*c(0:3), labels = function(x){
case_when(x == 0 ~ '', x == 8 ~ '8am', x == 16 ~ '4pm', x == 24 ~ '12am')
}) +
labs(x = 'Time', y = 'Percentage of delayed flights', color = 'Airline',
title = 'Percentage of delayed flight by time of the week averaged across 2015',
subtitle = 'Top US airlines', caption = 'Source: publicly available data from DoT') +
theme(plot.caption = element_text(vjust = 7)) +
guides(fill = guide_legend(title = NULL))
Delay by hour of day for top airlines
We can also plot the same data from a different angle with a heat map:

In order to do that, we note the following:
- The
geom_tile()
layer plots the heatmap of percentage of delayed flights by hour of the day for each airline, by taking in thex
,y
andfill
arguments. - The appearance of the plot is customized within the
theme()
layer. In particular, the background grid is removed with theelement_blank()
function.
ggplot(data = data) +
geom_tile(aes(x = hour, y = airline, fill = perc_delay), color = 'black') +
scale_x_continuous(breaks = c(5,11,17,22), labels = function(x){
case_when(x == 5 ~ '5am', x == 11 ~ '11am', x == 17 ~ '5pm', x == 22 ~ '10pm')
}) +
scale_fill_distiller(palette = 'Spectral', labels = function(x) paste0(x*100, '%')) +
labs(x = 'Hour of day', y = 'Airline', fill = 'Delayed flights',
title = 'Percentage of delayed flights by hour of the day',
subtitle = 'Averaged across 2015 for top US airlines',
caption = 'Source: publicly available data from DoT') +
theme(panel.grid.major = element_blank(), axis.ticks = element_blank(),
plot.caption = element_text(vjust = 7), axis.title.y = element_blank())
Customizing visualization
Scatter plot of airports with delays
We want to plot:

In order to do that, we note the following:
- The text labels are put in place using the
geom_label_repel()
, which comes from theggrepel
library. - The
size
andcolor
options are used to customize the appearance of the scatter plot.
# data (airport_code, region, nb_flights, perc_delay, weather)
ggplot(data = data) +
geom_point(aes(x = weather, y = perc_delay, color = region, size = nb_flights)) +
geom_label_repel(data = data %>% head(5), nudge_x = 0.002, nudge_y = -0.015,
aes(x = weather, y = perc_delay, label = airport_code)) +
scale_x_continuous(labels = function(x) paste0(x*100, '%')) +
scale_y_continuous(labels = function(y) paste0(y*100, '%')) +
labs(x = 'Likelihood weather is causing delay', y = 'Percentage of delayed flights',
color = 'Region', size = 'Volume of traffic',
title = 'Percentage of delayed flights vs likelihood of weather causing the delay',
subtitle = 'Top US airports across 2015', caption = 'Source: publicly available data from DoT') +
theme(plot.caption = element_text(vjust = 7))
Animated graphs
We want ot have an animated graph to understand the temporal evolution:

In order to do that, we note the following:
- The
transition_states()
layer coming from thegganimate
library enables to produce the animation along a given dimension. In this example, the animation is produced along themonth
variable.
# data (airport_code, region, month, nb_flights, perc_delay, weather)
ggplot(data = data) +
geom_point(aes(x = weather, y = perc_delay, color = region, size = nb_flights)) +
scale_x_continuous(labels = function(x) paste0(x*100, '%')) +
scale_y_continuous(labels = function(y) paste0(y*100, '%')) +
scale_size(labels = scales::comma) +
labs(x = 'Likelihood weather is causing delay', y = 'Percentage of delayed flights',
color = 'Region', size = 'Number of outbound flights',
title = 'Percentage of delayed flights vs likelihood of weather causing the delay',
subtitle = paste0('Top US airports in ', '{closest_state}', ' 2015'),
caption = 'Source: publicly available data from DoT') +
theme(plot.caption = element_text(vjust = 7)) +
transition_states(month, state_length = 0)
Percentage of delayed flights with volume
We want to plot:

In order to do that, we note the following:
- The second $y$ axis is produced by using the
sec.axis()
function within thescale_y_continuous()
layer. - The color of the axes labels are specified within the
theme()
layer. - Dates on the $x$ axis are displayed in a customized way using the
date_labels
argument of thescale_x_date()
layer.
# data
ggplot(data = data) +
geom_line(aes(x = date, y = nb_flights), color = 'green4') +
geom_line(aes(x = date, y = perc_delay * 20000), color = 'red') +
scale_x_date(date_breaks = '1 month', date_labels = '%b %d') +
scale_y_continuous(sec.axis = sec_axis(~ . / 20000,
labels = function(y){paste0(y*100, '%')},
name = 'Percentage of delayed flights')) +
labs(x = 'Time', y = 'Number of daily flights', caption = 'Source: publicly available data from DoT',
title = 'Number of flights in the US in 2015') +
theme(axis.text.x = element_text(angle = 25, vjust = 0.75), plot.caption = element_text(vjust = 7),
axis.title.y.left = element_text(color = 'green4'),
axis.title.y.right = element_text(color = 'red'))
Map-based plots
Top 10 routes
We want to plot the top routes departing from Boston airports on a map:

In order to do that, we note the following:
- The map of the US and its neighboring countries has been drawn using
geom_sf()
from polygon shapes using thene_countries()
function coming from thernaturalearth
library. - Curves between cities have been drawn using the
geom_curve()
layer and specifiying each side of the curve using thex
,y
,xend
,yend
parameters. - The
annotation_north_arrow()
andannotation_scale()
functions come from theggspatial
library and provide nice add-ons to the plot by indicating the North direction as well as how distances should be interpreted. Other tutorials making use of similar functions can give you an idea of the range of possibilities that such functionalities can offer. - The
coord_sf()
layer is used to customize the range of the axes that are shown on the plot.
# world, usa, routes
ggplot() +
geom_sf(data = world) +
geom_sf(data = usa, fill = 'chartreuse1', alpha = 0.05) +
geom_curve(data = routes, aes(x = longitude_o, y = latitude_o, xend = longitude_d, yend = latitude_d)) +
geom_point(data = routes, aes(x = longitude_d, y = latitude_d), size = 3) +
geom_point(data = routes, aes(x = longitude_o, y = latitude_o), size = 3, color = 'red') +
geom_label_repel(data = routes, nudge_x = 0, nudge_y = -0,
aes(x = longitude_d, y = latitude_d, label = destination_airport)) +
geom_label_repel(data = routes %>% select(origin_airport, longitude_o, latitude_o) %>% unique(),
nudge_x = 0, nudge_y = -0, color = 'red',
aes(x = longitude_o, y = latitude_o, label = origin_airport)) +
annotation_north_arrow(location = "bl", which_north = "true", style = north_arrow_fancy_orienteering,
pad_x = unit(0.25, "in"), pad_y = unit(0.25, "in")) +
annotation_scale(location = 'bl', width_hint = 0.5) +
coord_sf(xlim = c(-125, -64), ylim = c(24, 50)) +
labs(title = 'Top 10 routes departing from Boston Logan International Airport in 2015',
caption = 'Source: publicly available data from DoT') +
theme(panel.background = element_rect(fill = 'azure'), axis.title = element_blank())
Most popular airline per state
We want to plot a map that displays the most popular airline by state in the continental US:

In order to do that, we note the following:
- The
maps
library has been used to retrieve geographical data that is then converted to ansf
format. - It is then fed into the
geom_sf()
layer from theggspatial
library to draw polygon shapes. - The
coord_sf()
layer is used to customize the range of the axes that are shown on the plot.
# data (state, airline, long, lat, group, order)
ggplot() +
geom_sf(data = data, aes(fill = airline), color = 'black') +
coord_sf(xlim = c(-125, -67), ylim = c(24, 50)) +
labs(title = 'Most popular airline per state', subtitle = 'Data averaged across 2015',
caption = 'Source: publicly available data from DoT', fill = 'Airline') +
theme_void() +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
axis.title = element_blank())
Airports with highest traffic
We want to plot a map that shows the top 10 busiest airports in the western US:

In order to do that, we note the following:
- The map itself has been drawn using polygon shapes, for which the data was retrieved using the states data from the
maps
library, that has then been converted to ansf
object using thest_as_sf()
function, and then fed into thegeom_sf()
layer which comes from theggspatial
library. - The text labels have been put in place using the
geom_label_repel()
, which comes from theggrepel
library.
# data
ggplot() +
geom_sf(data = west, color = 'black', alpha = 0.1, show.legend = FALSE, aes(fill = ID)) +
geom_point(data = data, aes(x = long, y = lat, size = nb_flights)) +
geom_label_repel(data = data, nudge_x = 0.2, nudge_y = -0.015,
aes(x = long, y = lat, label = airport_code, color = rank)) +
geom_text(data = west %>% select(longitude, latitude, abbrev) %>% unique(),
aes(x = longitude, y = latitude, label = abbrev)) +
labs(title = 'Top 10 busiest airports in western US', subtitle = 'Data averaged across 2015',
caption = 'Source: publicly available data from DoT', size = 'Number of outbound flights') +
scale_color_manual(labels = paste0(data$rank, ' - ', data$airport_name, ' (', data$airport_code, ')'),
name = 'Top airports', values = rep('black', 10)) +
scale_size(labels = scales::comma) +
theme_void() +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
guides(color = guide_legend(override.aes = list(size = 0)))
Conclusion
This tutorial presented illustrated examples that showed how to produce visually-appealing plots. This skill is useful when wanting to make a plot that is clear and concise, for instance in a business setting or for high stake presentations.