Banner

Data science tools

Tutorial on how to make beautiful plots with R

Star
R ggplot2 ggrepel gganimate ggspatial sf

By Afshine Amidi and Shervine Amidi

Motivation

The Department of Transportation publicly released a dataset that lists flights that occurred in 2015 along with specificities such as delays, flight time and other information. Our previous post detailed the best practices to manipulate data.

This article aims at showing good practices to visualize data using R's most popular libraries. The following are covered:

library(ggplot2)                         # Plots
library(ggrepel)                         # Nice labels
library(gganimate)                       # Animations
library(ggspatial);library(sf)           # Map plots
library(maps);library(rnaturalearth)     # Map data
theme_set(theme_bw())                    # Set theme for all plots

A common strategy to make beautiful plots is to have data ready in a data frame and use the columns and rows of that data frame to plot. This tutorial will focus on the plotting part. As to get the data in the specified format, please refer to the associated tutorial and study guide.


Temporal plots

Evolution of number of flights

We want to plot the temporal evolution of flights of major US airlines:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data (date, airline, nb_flights)
ggplot(data = data) +
  geom_line(aes(x = date, y = nb_flights, color = airline)) +
  geom_vline(xintercept = as.Date('2015-07-01'),
             linetype = 'dashed', alpha = 0.5, color = 'black'+
  scale_x_date(date_breaks = '1 month', date_labels = '%b %d'+
  labs(x = 'Time', y = 'Number of daily flights', color = 'Airline',
       title = 'Number of flights in the US in 2015',
       subtitle = 'Top airlines',
       caption = 'Source: publicly available data from DoT'+
  theme(axis.text.x = element_text(angle = 25, vjust = 0.75),
        plot.caption = element_text(vjust = 7)) +
  ylim(04000)

Evolution of traffic per airport

We want to plot temporal evolution of traffic in the top Hawaiian airports:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data (date, airport, nb_flights), time_periods (xmin, xmax, ymin, ymax, name, color)
ggplot(data = data) +
  geom_rect(data = time_periods, alpha = 0.1,
            aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = name)) +
  geom_line(aes(x = date, y = nb_flights, color = airport)) +
  scale_x_date(date_breaks = '1 month', date_labels = '%b %d'+
  scale_fill_manual(breaks = time_periods[,'name'], values = time_periods[,'color']) +
  labs(x = 'Time', y = 'Number of daily flights', color = 'Airport', fill = 'Time window',
       title = 'Number of flights departing from Hawaiian airports in 2015',
       subtitle = 'Top 5 Hawaii airports',
       caption = 'Source: publicly available data from DoT'+
  theme(axis.text.x = element_text(angle = 25, vjust = 0.75), plot.caption = element_text(vjust = 7)) +
  ylim(0150)

Evolution of delay type

We want to plot the temporal evolution of delays by delay type for the northern part of the US that are more prone to weather-related delays:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data (week, delay_type, perc)
ggplot(data = data) +
  geom_bar(aes(x = week, y = perc, fill = delay_type),
           position = 'stack', stat = 'identity'+
  scale_x_date(date_breaks = '1 month', date_labels = '%b %d'+
  scale_y_continuous(labels = function(x) paste0(x*100'%')) +
  labs(x = 'Time', y = 'Percentage of weekly flights', fill = 'Delay type',
       title = 'Breakdown of delay type of flights in the northern part of the US in 2015',
       subtitle = 'States that are part of the analysis include AK, IL, IN, MA, ME, MI, MN, NH, NY, VT',
       caption = 'Source: publicly available data from DoT'+
  theme(axis.text.x = element_text(angle = 25, vjust = 0.75), plot.caption = element_text(vjust = 7))

Visualizing across multiple dimensions

Delay by time of the week for top airlines

We want to plot the percentage of delayed flights by hour of the day and day of the week for the top airlines:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data (day_of_week, hour, airline, perc_delay), time_periods (xmin, xmax, ymin, ymax, name, color)
ggplot(data = data) +
  geom_rect(data = time_periods, alpha = 0.2,
            aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = name)) +
  geom_line(aes(x = hour, y = perc_delay, color = airline), size = 0.5+
  geom_point(aes(x = hour, y = perc_delay, color = airline), size = 0.7+
  facet_grid(~day_of_week) +
  scale_x_continuous(limits = c(0,24), breaks = 8*c(0:3), labels = function(x){
    case_when(x == 0 ~ '', x == 8 ~ '8am', x == 16 ~ '4pm', x == 24 ~ '12am')
    }) +
  labs(x = 'Time', y = 'Percentage of delayed flights', color = 'Airline',
       title = 'Percentage of delayed flight by time of the week averaged across 2015',
       subtitle = 'Top US airlines', caption = 'Source: publicly available data from DoT'+
  theme(plot.caption = element_text(vjust = 7)) +
  guides(fill = guide_legend(title = NULL))

Delay by hour of day for top airlines

We can also plot the same data from a different angle with a heat map:

In order to do that, we note the following:

The code used to produce the plot is shown below.

ggplot(data = data) +
  geom_tile(aes(x = hour, y = airline, fill = perc_delay), color = 'black'+
  scale_x_continuous(breaks = c(5,11,17,22), labels = function(x){
    case_when(x == 5 ~ '5am', x == 11 ~ '11am', x == 17 ~ '5pm', x == 22 ~ '10pm')
  }) +
  scale_fill_distiller(palette = 'Spectral', labels = function(x) paste0(x*100'%')) +
  labs(x = 'Hour of day', y = 'Airline', fill = 'Delayed flights',
       title = 'Percentage of delayed flights by hour of the day',
       subtitle = 'Averaged across 2015 for top US airlines',
       caption = 'Source: publicly available data from DoT'+
  theme(panel.grid.major = element_blank(), axis.ticks = element_blank(),
        plot.caption = element_text(vjust = 7), axis.title.y = element_blank())

Customizing visualization

Scatter plot of airports with delays

We want to plot:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data (airport_code, region, nb_flights, perc_delay, weather)
ggplot(data = data) +
  geom_point(aes(x = weather, y = perc_delay, color = region, size = nb_flights)) +
  geom_label_repel(data = data %>% head(5), nudge_x = 0.002, nudge_y = -0.015,
                   aes(x = weather, y = perc_delay, label = airport_code)) +
  scale_x_continuous(labels = function(x) paste0(x*100'%')) +
  scale_y_continuous(labels = function(y) paste0(y*100'%')) +
  labs(x = 'Likelihood weather is causing delay', y = 'Percentage of delayed flights',
       color = 'Region', size = 'Volume of traffic',
       title = 'Percentage of delayed flights vs likelihood of weather causing the delay',
       subtitle = 'Top US airports across 2015', caption = 'Source: publicly available data from DoT'+
  theme(plot.caption = element_text(vjust = 7))

Animated graphs

We want ot have an animated graph to understand the temporal evolution:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data (airport_code, region, month, nb_flights, perc_delay, weather)
ggplot(data = data) +
  geom_point(aes(x = weather, y = perc_delay, color = region, size = nb_flights)) +
  scale_x_continuous(labels = function(x) paste0(x*100'%')) +
  scale_y_continuous(labels = function(y) paste0(y*100'%')) +
  scale_size(labels = scales::comma) +
  labs(x = 'Likelihood weather is causing delay', y = 'Percentage of delayed flights',
       color = 'Region', size = 'Number of outbound flights',
       title = 'Percentage of delayed flights vs likelihood of weather causing the delay',
       subtitle = paste0('Top US airports in ''{closest_state}'' 2015'),
       caption = 'Source: publicly available data from DoT'+
  theme(plot.caption = element_text(vjust = 7)) +
  transition_states(month, state_length = 0)

Percentage of delayed flights with volume

We want to plot:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data
ggplot(data = data) +
  geom_line(aes(x = date, y = nb_flights), color = 'green4'+
  geom_line(aes(x = date, y = perc_delay * 20000), color = 'red'+
  scale_x_date(date_breaks = '1 month', date_labels = '%b %d'+
  scale_y_continuous(sec.axis = sec_axis(~ . / 20000,
                                         labels = function(y){paste0(y*100'%')},
                                         name = 'Percentage of delayed flights')) +
  labs(x = 'Time', y = 'Number of daily flights', caption = 'Source: publicly available data from DoT',
       title = 'Number of flights in the US in 2015'+
  theme(axis.text.x = element_text(angle = 25, vjust = 0.75), plot.caption = element_text(vjust = 7),
        axis.title.y.left = element_text(color = 'green4'),
        axis.title.y.right = element_text(color = 'red'))

Map-based plots

Top 10 routes

We want to plot the top routes departing from Boston airports on a map:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# world, usa, routes
ggplot() +
  geom_sf(data = world) +
  geom_sf(data = usa, fill = 'chartreuse1', alpha = 0.05+
  geom_curve(data = routes, aes(x = longitude_o, y = latitude_o, xend = longitude_d, yend = latitude_d)) +
  geom_point(data = routes, aes(x = longitude_d, y = latitude_d), size = 3+
  geom_point(data = routes, aes(x = longitude_o, y = latitude_o), size = 3, color = 'red'+
  geom_label_repel(data = routes, nudge_x = 0, nudge_y = -0,
                   aes(x = longitude_d, y = latitude_d, label = destination_airport)) +
  geom_label_repel(data = routes %>% select(origin_airport, longitude_o, latitude_o) %>% unique(),
                   nudge_x = 0, nudge_y = -0, color = 'red',
                   aes(x = longitude_o, y = latitude_o, label = origin_airport)) +
  annotation_north_arrow(location = "bl", which_north = "true", style = north_arrow_fancy_orienteering,
                         pad_x = unit(0.25"in"), pad_y = unit(0.25"in")) +
  annotation_scale(location = 'bl', width_hint = 0.5+
  coord_sf(xlim = c(-125-64), ylim = c(2450)) +
  labs(title = 'Top 10 routes departing from Boston Logan International Airport in 2015',
       caption = 'Source: publicly available data from DoT'+
  theme(panel.background = element_rect(fill = 'azure'), axis.title = element_blank())

Most popular airline per state

We want to plot a map that displays the most popular airline by state in the continental US:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data (state, airline, long, lat, group, order)
ggplot() +
  geom_sf(data = data, aes(fill = airline), color = 'black'+
  coord_sf(xlim = c(-125-67), ylim = c(2450)) +
  labs(title = 'Most popular airline per state', subtitle = 'Data averaged across 2015',
       caption = 'Source: publicly available data from DoT', fill = 'Airline'+
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
        axis.title = element_blank())

Airports with highest traffic

We want to plot a map that shows the top 10 busiest airports in the western US:

In order to do that, we note the following:

The code used to produce the plot is shown below.

# data
ggplot() +
  geom_sf(data = west, color = 'black', alpha = 0.1, show.legend = FALSEaes(fill = ID)) +
  geom_point(data = data, aes(x = long, y = lat, size = nb_flights)) +
  geom_label_repel(data = data, nudge_x = 0.2, nudge_y = -0.015,
                   aes(x = long, y = lat, label = airport_code, color = rank)) +
  geom_text(data = west %>% select(longitude, latitude, abbrev) %>% unique(),
            aes(x = longitude, y = latitude, label = abbrev)) +
  labs(title = 'Top 10 busiest airports in western US', subtitle = 'Data averaged across 2015',
       caption = 'Source: publicly available data from DoT', size = 'Number of outbound flights'+
  scale_color_manual(labels = paste0(data$rank, ' - ', data$airport_name, ' (', data$airport_code, ')'),
                     name = 'Top airports', values = rep('black'10)) +
  scale_size(labels = scales::comma) +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
  guides(color = guide_legend(override.aes = list(size = 0)))

Conclusion

This tutorial presented illustrated examples that showed how to produce visually-appealing plots. This skill is useful when wanting to make a plot that is clear and concise, for instance in a business setting or for high stake presentations.


You may also like...

Data visualization with R
  • • Scatterplots, line plots, histograms
  • • Boxplots, maps
  • • Customized legend
Data manipulation with R
  • • Filtering
  • • Types of joins
  • • Aggregations, window functions
  • • Data frame transformation
Data manipulation with R
  • Detailed example on how to process data efficiently with dplyr, tidyr, lubridate