Want more content like this? Subscribe here to be notified of new releases!

Data science tools

Tutorial on how to manipulate data with R

Star

R dplyr tidyr lubridate

Motivation

The Department of Transportation publicly released a dataset that lists 5.8M flights that occurred in 2015, along with specificities such as delays, flight time and other information.

This article aims at showing good practices to manipulate data with R's most popular libraries using practical examples on the data above. The following topics are covered:

data cleaning with dplyr
data frame transformation with tidyr
handling of date-related values with lubridate

library(dplyr)      # Data frame manipulation
library(tidyr)      # Data frame transformation
library(lubridate)  # Date operations

In the parts below, we will focus on drawing insights about flights departing from BOS, JFK, SFO and LAX.

Data preprocessing

Retrieving data

We retrieve the data frames from the respective csv files and we take a peek at data types by running the following commands:

# Set path to working directory
setwd('path/to/dir')

# Retrieve data
flights_df_raw << read.csv('data/flights.csv')
airports_df << read.csv('data/airports.csv')
airlines_df << read.csv('data/airlines.csv')

# View data types and first rows of data frame
str(flights_df_raw)

> str(flights_df_raw)
Classes ‘data.table’ and 'data.frame':	5819079 obs. of  9 variables:
 $ destination_airport: chr  "ABE" "ABE" "ABE" "ABE" ...
 $ origin_airport     : chr  "ATL" "ATL" "ATL" "ATL" ...
 $ year               : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ month              : int  10 10 10 10 10 10 10 10 10 10 ...
 $ day                : int  1 1 1 2 2 2 3 3 4 4 ...
 $ airline            : chr  "EV" "EV" "DL" "EV" ...
 $ flight_number      : int  4940 5197 2582 4940 5197 2582 4940 2582 4940 5197 ...
 $ scheduled_departure: int  1027 1444 2106 1027 1444 2106 1027 2106 1027 1444 ...
 $ departure_delay    : int  -3 0 -3 12 -4 -2 -4 -2 0 18 ..

From the output above, we see that the data frame has 5.8M observations and 9 dimensions that inform on when the flight happened, the origin and destination, any potential delays as well as other miscellaneous indicators.

Check for data quality

We want to see if there are any outliers, absurd values or missing values. In order to do that, we run the following command:

# Compute statistics of columns
summary(flights_df_raw)

> summary(flights_df_raw)
 destination_airport origin_airport          year          month             day
 Length:5819079      Length:5819079     Min.   :2015   Min.   : 1.000   Min.   : 1.0
 Class :character    Class :character   1st Qu.:2015   1st Qu.: 4.000   1st Qu.: 8.0
 Mode  :character    Mode  :character   Median :2015   Median : 7.000   Median :16.0
                                        Mean   :2015   Mean   : 6.524   Mean   :15.7
                                        3rd Qu.:2015   3rd Qu.: 9.000   3rd Qu.:23.0
                                        Max.   :2015   Max.   :12.000   Max.   :31.0

   airline          flight_number  scheduled_departure departure_delay
 Length:5819079     Min.   :   1   Min.   :   1        Min.   : -82.00
 Class :character   1st Qu.: 730   1st Qu.: 917        1st Qu.:  -5.00
 Mode  :character   Median :1690   Median :1325        Median :  -2.00
                    Mean   :2173   Mean   :1330        Mean   :   9.37
                    3rd Qu.:3230   3rd Qu.:1730        3rd Qu.:   7.00
                    Max.   :9855   Max.   :2359        Max.   :1988.00
                                                       NA's   :86153

From the output above, we see that there are 86k observations (~1.5%) that have null values coming from the departure_delay column. We perform the following actions:

We remove rows with missing values with the na.omit() function.
We keep flights departing from airports that we want to look at with the filter() function.
Looking at the mean and median of departure_delay, we see that values are heavily right-skewed, and we have a maximum delay of 1988 (~ 33 hours). We choose to filter out flights that have more than 1 day delay.
We also convert flight_number from being integers to being character values with the as.character() function, by noting that these are IDs are have no ordered meaning.

Fix columns

In this part, we fix existing columns and add new ones that will be useful later on:

We add the following columns using the mutate() function, time of flight in POSIXct format by combining existing columns:

  convert_to_POSIXct << function(year, month, day, time){
    result << as.POSIXct(paste0(
      as.character(year), '-', as.character(month), '-', as.character(day), ' ',  # Date
      paste0(strrep('0', 4 - nchar(time)), as.character(time))                    # Time
    ), format = '%Y-%m-%d %H%M')

    return(result)
  }

We also want to have a binary answer to whether a flight was delayed or not. Using the departure_delay column that gives the actual number of delay minutes, we consider that any flight above 15 minutes delay are delayed, and any other flight is not delayed. We add this flag by using the case_when() command.

Processing summary

The code to perform the abovementioned processing steps is as follows:

flights_df << flights_df_raw %>%
  # All rows should not have any null value
  na.omit() %>%
  # Only flights from set of airports and with reasonable delay amount
  filter((origin_airport %in% c('BOS', 'SFO', 'LAX', 'JFK')) &
           (departure_delay <= 24*60)) %>%
  # Fix columns
  mutate(flight_number = as.character(flight_number),
         scheduled_departure = convert_to_POSIXct(year, month, day, scheduled_departure),
         is_delayed = case_when(departure_delay >= 15 ~ 1,
                                TRUE ~ 0)) %>%
  # Remove redundant columns
  select(-year, -month, -day)

> summary(flights_df)
 destination_airport origin_airport       airline          flight_number
 Length:585905       Length:585905      Length:585905      Length:585905
 Class :character    Class :character   Class :character   Class :character
 Mode  :character    Mode  :character   Mode  :character   Mode  :character

 scheduled_departure           departure_delay     is_delayed
 Min.   :2015-01-01 00:10:00   Min.   : -38.00   Min.   :0.0000
 1st Qu.:2015-04-07 09:30:00   1st Qu.:  -5.00   1st Qu.:0.0000
 Median :2015-07-05 08:25:00   Median :  -1.00   Median :0.0000
 Mean   :2015-07-03 23:28:19   Mean   :  10.43   Mean   :0.1996
 3rd Qu.:2015-09-29 21:10:00   3rd Qu.:   9.00   3rd Qu.:0.0000
 Max.   :2015-12-31 23:59:00   Max.   :1429.00   Max.   :1.0000

From the summary output above, the dataset is now down to 585k observations. Also, we see that the NA problem has been solved, and that the is_delayed statistics gives away that roughly 20% of flights according to our definition of delay.

Joins, aggregations

In this part, we will explore the data from different angles using basic data frame manipulation techniques with the dplyr library.

Get airport name

We want to obtain the airport name corresponding to the airport code attached to flights.

In order to do that, we note the following:

The flights_df data frame is joined with airports_df by using the merge() function.
The keys on which the data frames are being joined on are specified in the by.x and by.y arguments. It is good to note that if those keys had the same name, it would have been possible to have the single argument by along with the name of that key.

We obtain the following command:

result << flights_df %>%
  merge(airports_df, by.x = 'origin_airport', by.y = 'iata_code')

> head(result)
  origin_airport destination_airport airline flight_number scheduled_departure departure_delay
1            BOS                 ATL      WN          1017 2015-05-19 05:50:00              -3
2            BOS                 ATL      DL          1501 2015-05-19 07:15:00              -1
3            BOS                 LGA      DL          2665 2015-10-02 07:00:00              -3
4            BOS                 LGA      AA          2156 2015-10-01 20:00:00              69
  is_delayed                                          airport
1          0 Gen. Edward Lawrence Logan International Airport
2          0 Gen. Edward Lawrence Logan International Airport
3          0 Gen. Edward Lawrence Logan International Airport
4          1 Gen. Edward Lawrence Logan International Airport

Number of flights per airport

We want to obtain the number of departing flights per airport across the year.

In order to do that, we note the following:

The group_by() function groups the data by a given set of columns.
The summarize() function computes aggregations over the grouped columns specified in the previous step. One example of aggregation function is the n() function, which computes the number of rows within each group.

We obtain the following code:

result << flights_df %>%
  group_by(origin_airport) %>%
  summarize(nb_flights = n())

> head(result)
  origin_airport nb_flights
1 BOS                115393
2 JFK                100205
3 LAX                210240
4 SFO                160067

Delayed flights for biggest airlines

Among the biggest airlines, where we define the airline size as the number of yearly flights, we want to know which airline have less delays compared to others.

In order to do that, we note the following:

The percentage of delayed flights is first computed with the 2-step aggregation process using the group_by() and summarize() functions.
The ranking of each airline by their number of flights is then computed with the row_number() window function within the mutate() function. It is good to note that the ranking is done across all airlines.
The top 10 airlines with the highest volume of flighs are kept using the filter() function.
Names of airlines associated to their IATA code is then gathered using the merge() function with the airlines_df data frame.

We obtain the following code:

result << flights_df %>%
  # Compute airline size and delay statistics
  group_by(airline) %>%
  summarize(nb_flights = n(),
            perc_delayed = round(mean(is_delayed), 4)) %>%
  ungroup() %>%
  # Keep biggest airlines
  mutate(rank_size = row_number(-nb_flights)) %>%
  filter(rank_size <= 10) %>% select(-rank_size) %>%
  # Display airline name
  merge(airlines_df, by.x = 'airline', by.y = 'iata_code')

> head(result)
  airline nb_flights perc_delayed           airline_name
1      AA      85747       0.1499 American Airlines Inc.
2      AS      16196       0.1208   Alaska Airlines Inc.
3      US      16047       0.1445        US Airways Inc.

Data frame transformations

In this part, we will draw more advanced insights using data frame transformation techniques and window functions from the dplyr and tidyr libraries.

Temporal evolution of delays by airport

We want to obtain the monthly evolution of delays by origin airport.

In order to do that, we note the following:

A month column is created using the format() function along with the %B argument, which enables to retrieve the full month name out of a timestamp. The associated data type is a leveled factor so that it is possible to sort with the arrange() function.
The percentage of delayed flights is computed through the 2-step aggregation process using the group_by() and summarize() functions.
The data frame is reshaped from a long to a wide format using the spread() function that comes from the tidyr library.

We obtain the following code:

result << flights_df %>%
  # Create column specific full month name
  mutate(month = factor(format(scheduled_departure, '%B'),
                        levels = month.name)) %>%
  # Compute aggregation
  group_by(origin_airport, month) %>%
  summarize(perc_delayed = round(mean(is_delayed), 4)) %>%
  ungroup() %>%
  # Reshape data frame
  spread(key = 'origin_airport', value = 'perc_delayed') %>%
  # Order by month
  arrange(month)

> head(result)
     month    BOS    JFK    LAX    SFO
1  January 0.1902 0.2257 0.1738 0.2001
2 February 0.3248 0.3174 0.1978 0.2222
3    March 0.1984 0.2736 0.2246 0.1770
4    April 0.1553 0.2020 0.1855 0.1756

Month over month percent change in flights

We want to get the month over month percent change in flights.

In order to do that, we note the following:

Similar to the previous section, a month column is created using the format() function along with the %B argument, which enables to retrieve the full month name out of a timestamp. The associated data type is a leveled factor so that it is possible to sort with the arrange() function.
The lag() window function is used along with the group_by() function to find the number of flights of a given origin airport in a given month that happened one month before.

We obtain the following code:

result << flights_df %>%
  # Create column specific full month name
  mutate(month = factor(format(scheduled_departure, '%B'),
                        levels = month.name)) %>%
  # Compute number of flights
  group_by(origin_airport, month) %>%
  summarize(nb_flights = n()) %>%
  ungroup() %>%
  # Get lagged values
  group_by(origin_airport) %>%
  mutate(nb_flights_before = lag(nb_flights, order_by = month, n = 1),
         perc_change = round((nb_flights - nb_flights_before) / nb_flights_before, 2)) %>%
  ungroup() %>%
  # Last touches
  arrange(origin_airport, month)

> head(result)
  origin_airport month    nb_flights nb_flights_before perc_change
1 BOS            January        8201                NA       NA
2 BOS            February       7251              8201       -0.12
3 BOS            March          9744              7251        0.34
4 BOS            April          9940              9744        0.02

We note that by definition, there will not be any MoM change for the first month considered.

Most popular routes by origin airport

We are interested in knowing the top 3 routes per origin airport. We define top routes from a given airport as being routes with the highest number of flights departing from that airport.

In order to do that, we note the following:

The number of flights by route is first computed using the 2-step aggregation processing using the group_by() and summarize() functions.
Then, the rank of each route within each group of origin airport is found by first grouping by origin_airport and then using the row_number() window function.
The top routes are then filtered using the filter() function.
Finally, the data frame is reshaped using the spread() function from long to wide format.

We obtain the following code:

result << flights_df %>%
  # Compute volume per route
  group_by(origin_airport, destination_airport) %>%
  summarize(nb_flights = n()) %>%
  ungroup() %>%
  # Compute rank within each origin airport and keep top ones
  group_by(origin_airport) %>%
  mutate(rank_popular = row_number(-nb_flights)) %>%
  ungroup() %>%
  filter(rank_popular <= 3) %>%
  mutate(rank_popular = paste0('top_', rank_popular)) %>%
  select(-nb_flights) %>%
  # Reformat results
  spread(key = rank_popular, value = destination_airport)

> head(result)
  origin_airport top_1 top_2 top_3
1 BOS            DCA   LGA   ORD
2 JFK            LAX   SFO   MCO
3 LAX            SFO   JFK   LAS
4 SFO            LAX   JFK   LAS

Conclusion

Knowing how to manipulate data in a concise and effective way is crucial for anyone working with data. This is a necessary skill to have to be able to visualize data and make models off of it.

Tutorial on how to manipulate data with R

Motivation

Data preprocessing

Retrieving data

Check for data quality

Fix columns

Processing summary

Joins, aggregations

Get airport name

Number of flights per airport

Delayed flights for biggest airlines

Data frame transformations

Temporal evolution of delays by airport

Month over month percent change in flights

Most popular routes by origin airport

Conclusion

You may also like...