Banner

Data science tools

Tutorial on how to manipulate data with R

Star
R dplyr tidyr lubridate

By Afshine Amidi and Shervine Amidi

Motivation

The Department of Transportation publicly released a dataset that lists 5.8M flights that occurred in 2015, along with specificities such as delays, flight time and other information.

This article aims at showing good practices to manipulate data with R's most popular libraries using practical examples on the data above. The following topics are covered:

library(dplyr)      # Data frame manipulation
library(tidyr)      # Data frame transformation
library(lubridate)  # Date operations

In the parts below, we will focus on drawing insights about flights departing from BOS, JFK, SFO and LAX.


Data preprocessing

Retrieving data

We retrieve the data frames from the respective csv files and we take a peek at data types by running the following commands:

# Set path to working directory
setwd('path/to/dir')

# Retrieve data
flights_df_raw << read.csv('data/flights.csv')
airports_df << read.csv('data/airports.csv')
airlines_df << read.csv('data/airlines.csv')

# View data types and first rows of data frame
str(flights_df_raw)
> str(flights_df_raw)
Classes ‘data.table’ and 'data.frame':	5819079 obs. of  9 variables:
 $ destination_airport: chr  "ABE" "ABE" "ABE" "ABE" ...
 $ origin_airport     : chr  "ATL" "ATL" "ATL" "ATL" ...
 $ year               : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ month              : int  10 10 10 10 10 10 10 10 10 10 ...
 $ day                : int  1 1 1 2 2 2 3 3 4 4 ...
 $ airline            : chr  "EV" "EV" "DL" "EV" ...
 $ flight_number      : int  4940 5197 2582 4940 5197 2582 4940 2582 4940 5197 ...
 $ scheduled_departure: int  1027 1444 2106 1027 1444 2106 1027 2106 1027 1444 ...
 $ departure_delay    : int  -3 0 -3 12 -4 -2 -4 -2 0 18 ..

From the output above, we see that the data frame has 5.8M observations and 9 dimensions that inform on when the flight happened, the origin and destination, any potential delays as well as other miscellaneous indicators.


Check for data quality

We want to see if there are any outliers, absurd values or missing values. In order to do that, we run the following command:

# Compute statistics of columns
summary(flights_df_raw)
> summary(flights_df_raw)
 destination_airport origin_airport          year          month             day
 Length:5819079      Length:5819079     Min.   :2015   Min.   : 1.000   Min.   : 1.0
 Class :character    Class :character   1st Qu.:2015   1st Qu.: 4.000   1st Qu.: 8.0
 Mode  :character    Mode  :character   Median :2015   Median : 7.000   Median :16.0
                                        Mean   :2015   Mean   : 6.524   Mean   :15.7
                                        3rd Qu.:2015   3rd Qu.: 9.000   3rd Qu.:23.0
                                        Max.   :2015   Max.   :12.000   Max.   :31.0

   airline          flight_number  scheduled_departure departure_delay
 Length:5819079     Min.   :   1   Min.   :   1        Min.   : -82.00
 Class :character   1st Qu.: 730   1st Qu.: 917        1st Qu.:  -5.00
 Mode  :character   Median :1690   Median :1325        Median :  -2.00
                    Mean   :2173   Mean   :1330        Mean   :   9.37
                    3rd Qu.:3230   3rd Qu.:1730        3rd Qu.:   7.00
                    Max.   :9855   Max.   :2359        Max.   :1988.00
                                                       NA's   :86153

From the output above, we see that there are 86k observations (~1.5%) that have null values coming from the departure_delay column. We perform the following actions:


Fix columns

In this part, we fix existing columns and add new ones that will be useful later on:


Processing summary

The code to perform the abovementioned processing steps is as follows:

flights_df << flights_df_raw %>%
  # All rows should not have any null value
  na.omit() %>%
  # Only flights from set of airports and with reasonable delay amount
  filter((origin_airport %in% c('BOS''SFO''LAX''JFK')) &
           (departure_delay <= 24*60)) %>%
  # Fix columns
  mutate(flight_number = as.character(flight_number),
         scheduled_departure = convert_to_POSIXct(year, month, day, scheduled_departure),
         is_delayed = case_when(departure_delay >= 15 ~ 1,
                                TRUE ~ 0)) %>%
  # Remove redundant columns
  select(-year, -month, -day)
> summary(flights_df)
 destination_airport origin_airport       airline          flight_number
 Length:585905       Length:585905      Length:585905      Length:585905
 Class :character    Class :character   Class :character   Class :character
 Mode  :character    Mode  :character   Mode  :character   Mode  :character

 scheduled_departure           departure_delay     is_delayed
 Min.   :2015-01-01 00:10:00   Min.   : -38.00   Min.   :0.0000
 1st Qu.:2015-04-07 09:30:00   1st Qu.:  -5.00   1st Qu.:0.0000
 Median :2015-07-05 08:25:00   Median :  -1.00   Median :0.0000
 Mean   :2015-07-03 23:28:19   Mean   :  10.43   Mean   :0.1996
 3rd Qu.:2015-09-29 21:10:00   3rd Qu.:   9.00   3rd Qu.:0.0000
 Max.   :2015-12-31 23:59:00   Max.   :1429.00   Max.   :1.0000

From the summary output above, the dataset is now down to 585k observations. Also, we see that the NA problem has been solved, and that the is_delayed statistics gives away that roughly 20% of flights according to our definition of delay.


Joins, aggregations

In this part, we will explore the data from different angles using basic data frame manipulation techniques with the dplyr library.

Get airport name

We want to obtain the airport name corresponding to the airport code attached to flights.

In order to do that, we note the following:

We obtain the following command:

result << flights_df %>%
  merge(airports_df, by.x = 'origin_airport', by.y = 'iata_code')
> head(result)
  origin_airport destination_airport airline flight_number scheduled_departure departure_delay
1            BOS                 ATL      WN          1017 2015-05-19 05:50:00              -3
2            BOS                 ATL      DL          1501 2015-05-19 07:15:00              -1
3            BOS                 LGA      DL          2665 2015-10-02 07:00:00              -3
4            BOS                 LGA      AA          2156 2015-10-01 20:00:00              69
  is_delayed                                          airport
1          0 Gen. Edward Lawrence Logan International Airport
2          0 Gen. Edward Lawrence Logan International Airport
3          0 Gen. Edward Lawrence Logan International Airport
4          1 Gen. Edward Lawrence Logan International Airport

Number of flights per airport

We want to obtain the number of departing flights per airport across the year.

In order to do that, we note the following:

We obtain the following code:

result << flights_df %>%
  group_by(origin_airport) %>%
  summarize(nb_flights = n())
> head(result)
  origin_airport nb_flights
1 BOS                115393
2 JFK                100205
3 LAX                210240
4 SFO                160067

Delayed flights for biggest airlines

Among the biggest airlines, where we define the airline size as the number of yearly flights, we want to know which airline have less delays compared to others.

In order to do that, we note the following:

We obtain the following code:

result << flights_df %>%
  # Compute airline size and delay statistics
  group_by(airline) %>%
  summarize(nb_flights = n(),
            perc_delayed = round(mean(is_delayed), 4)) %>%
  ungroup() %>%
  # Keep biggest airlines
  mutate(rank_size = row_number(-nb_flights)) %>%
  filter(rank_size <= 10%>% select(-rank_size) %>%
  # Display airline name
  merge(airlines_df, by.x = 'airline', by.y = 'iata_code')
> head(result)
  airline nb_flights perc_delayed           airline_name
1      AA      85747       0.1499 American Airlines Inc.
2      AS      16196       0.1208   Alaska Airlines Inc.
3      US      16047       0.1445        US Airways Inc.

Data frame transformations

In this part, we will draw more advanced insights using data frame transformation techniques and window functions from the dplyr and tidyr libraries.

Temporal evolution of delays by airport

We want to obtain the monthly evolution of delays by origin airport.

In order to do that, we note the following:

We obtain the following code:

result << flights_df %>%
  # Create column specific full month name
  mutate(month = factor(format(scheduled_departure, '%B'),
                        levels = month.name)) %>%
  # Compute aggregation
  group_by(origin_airport, month) %>%
  summarize(perc_delayed = round(mean(is_delayed), 4)) %>%
  ungroup() %>%
  # Reshape data frame
  spread(key = 'origin_airport', value = 'perc_delayed'%>%
  # Order by month
  arrange(month)
> head(result)
     month    BOS    JFK    LAX    SFO
1  January 0.1902 0.2257 0.1738 0.2001
2 February 0.3248 0.3174 0.1978 0.2222
3    March 0.1984 0.2736 0.2246 0.1770
4    April 0.1553 0.2020 0.1855 0.1756

Month over month percent change in flights

We want to get the month over month percent change in flights.

In order to do that, we note the following:

We obtain the following code:

result << flights_df %>%
  # Create column specific full month name
  mutate(month = factor(format(scheduled_departure, '%B'),
                        levels = month.name)) %>%
  # Compute number of flights
  group_by(origin_airport, month) %>%
  summarize(nb_flights = n()) %>%
  ungroup() %>%
  # Get lagged values
  group_by(origin_airport) %>%
  mutate(nb_flights_before = lag(nb_flights, order_by = month, n = 1),
         perc_change = round((nb_flights - nb_flights_before) / nb_flights_before, 2)) %>%
  ungroup() %>%
  # Last touches
  arrange(origin_airport, month)
> head(result)
  origin_airport month    nb_flights nb_flights_before perc_change
1 BOS            January        8201                NA       NA
2 BOS            February       7251              8201       -0.12
3 BOS            March          9744              7251        0.34
4 BOS            April          9940              9744        0.02

We note that by definition, there will not be any MoM change for the first month considered.


Most popular routes by origin airport

We are interested in knowing the top 3 routes per origin airport. We define top routes from a given airport as being routes with the highest number of flights departing from that airport.

In order to do that, we note the following:

We obtain the following code:

result << flights_df %>%
  # Compute volume per route
  group_by(origin_airport, destination_airport) %>%
  summarize(nb_flights = n()) %>%
  ungroup() %>%
  # Compute rank within each origin airport and keep top ones
  group_by(origin_airport) %>%
  mutate(rank_popular = row_number(-nb_flights)) %>%
  ungroup() %>%
  filter(rank_popular <= 3%>%
  mutate(rank_popular = paste0('top_', rank_popular)) %>%
  select(-nb_flights) %>%
  # Reformat results
  spread(key = rank_popular, value = destination_airport)
> head(result)
  origin_airport top_1 top_2 top_3
1 BOS            DCA   LGA   ORD
2 JFK            LAX   SFO   MCO
3 LAX            SFO   JFK   LAS
4 SFO            LAX   JFK   LAS  

Conclusion

Knowing how to manipulate data in a concise and effective way is crucial for anyone working with data. This is a necessary skill to have to be able to visualize data and make models off of it.


You may also like...

Data manipulation with R
  • • Filtering
  • • Types of joins
  • • Aggregations, window functions
  • • Data frame transformation
Data visualization with R
  • • Scatterplots, line plots, histograms
  • • Boxplots, maps
  • • Customized legend
Beautiful plots with R
  • Detailed example on how to make beautiful plots with ggplot2, gganimate, sf