Data science tools
Tutorial on how to manipulate data with R
R
dplyr
tidyr
lubridate
By Afshine Amidi and Shervine Amidi
Motivation
The Department of Transportation publicly released a dataset that lists 5.8M flights that occurred in 2015, along with specificities such as delays, flight time and other information.
This article aims at showing good practices to manipulate data with R's most popular libraries using practical examples on the data above. The following topics are covered:
- data cleaning with
dplyr
- data frame transformation with
tidyr
- handling of date-related values with
lubridate
library(dplyr) # Data frame manipulation
library(tidyr) # Data frame transformation
library(lubridate) # Date operations
In the parts below, we will focus on drawing insights about flights departing from BOS, JFK, SFO and LAX.
Data preprocessing
Retrieving data
We retrieve the data frames from the respective csv files and we take a peek at data types by running the following commands:
# Set path to working directory
setwd('path/to/dir')
# Retrieve data
flights_df_raw << read.csv('data/flights.csv')
airports_df << read.csv('data/airports.csv')
airlines_df << read.csv('data/airlines.csv')
# View data types and first rows of data frame
str(flights_df_raw)
> str(flights_df_raw)
Classes ‘data.table’ and 'data.frame': 5819079 obs. of 9 variables:
$ destination_airport: chr "ABE" "ABE" "ABE" "ABE" ...
$ origin_airport : chr "ATL" "ATL" "ATL" "ATL" ...
$ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ month : int 10 10 10 10 10 10 10 10 10 10 ...
$ day : int 1 1 1 2 2 2 3 3 4 4 ...
$ airline : chr "EV" "EV" "DL" "EV" ...
$ flight_number : int 4940 5197 2582 4940 5197 2582 4940 2582 4940 5197 ...
$ scheduled_departure: int 1027 1444 2106 1027 1444 2106 1027 2106 1027 1444 ...
$ departure_delay : int -3 0 -3 12 -4 -2 -4 -2 0 18 ..
From the output above, we see that the data frame has 5.8M observations and 9 dimensions that inform on when the flight happened, the origin and destination, any potential delays as well as other miscellaneous indicators.
Check for data quality
We want to see if there are any outliers, absurd values or missing values. In order to do that, we run the following command:
# Compute statistics of columns
summary(flights_df_raw)
> summary(flights_df_raw)
destination_airport origin_airport year month day
Length:5819079 Length:5819079 Min. :2015 Min. : 1.000 Min. : 1.0
Class :character Class :character 1st Qu.:2015 1st Qu.: 4.000 1st Qu.: 8.0
Mode :character Mode :character Median :2015 Median : 7.000 Median :16.0
Mean :2015 Mean : 6.524 Mean :15.7
3rd Qu.:2015 3rd Qu.: 9.000 3rd Qu.:23.0
Max. :2015 Max. :12.000 Max. :31.0
airline flight_number scheduled_departure departure_delay
Length:5819079 Min. : 1 Min. : 1 Min. : -82.00
Class :character 1st Qu.: 730 1st Qu.: 917 1st Qu.: -5.00
Mode :character Median :1690 Median :1325 Median : -2.00
Mean :2173 Mean :1330 Mean : 9.37
3rd Qu.:3230 3rd Qu.:1730 3rd Qu.: 7.00
Max. :9855 Max. :2359 Max. :1988.00
NA's :86153
From the output above, we see that there are 86k observations (~1.5%) that have null values coming from the departure_delay
column. We perform the following actions:
- We remove rows with missing values with the
na.omit()
function. - We keep flights departing from airports that we want to look at with the
filter()
function. - Looking at the mean and median of
departure_delay
, we see that values are heavily right-skewed, and we have a maximum delay of 1988 (~ 33 hours). We choose to filter out flights that have more than 1 day delay. - We also convert
flight_number
from being integers to being character values with theas.character()
function, by noting that these are IDs are have no ordered meaning.
Fix columns
In this part, we fix existing columns and add new ones that will be useful later on:
- We add the following columns using the
mutate()
function, time of flight inPOSIXct
format by combining existing columns:
convert_to_POSIXct << function(year, month, day, time){ result << as.POSIXct(paste0( as.character(year), '-', as.character(month), '-', as.character(day), ' ', # Date paste0(strrep('0', 4 - nchar(time)), as.character(time)) # Time ), format = '%Y-%m-%d %H%M') return(result) }
- We also want to have a binary answer to whether a flight was delayed or not. Using the
departure_delay
column that gives the actual number of delay minutes, we consider that any flight above 15 minutes delay are delayed, and any other flight is not delayed. We add this flag by using thecase_when()
command.
Processing summary
The code to perform the abovementioned processing steps is as follows:
flights_df << flights_df_raw %>%
# All rows should not have any null value
na.omit() %>%
# Only flights from set of airports and with reasonable delay amount
filter((origin_airport %in% c('BOS', 'SFO', 'LAX', 'JFK')) &
(departure_delay <= 24*60)) %>%
# Fix columns
mutate(flight_number = as.character(flight_number),
scheduled_departure = convert_to_POSIXct(year, month, day, scheduled_departure),
is_delayed = case_when(departure_delay >= 15 ~ 1,
TRUE ~ 0)) %>%
# Remove redundant columns
select(-year, -month, -day)
> summary(flights_df)
destination_airport origin_airport airline flight_number
Length:585905 Length:585905 Length:585905 Length:585905
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
scheduled_departure departure_delay is_delayed
Min. :2015-01-01 00:10:00 Min. : -38.00 Min. :0.0000
1st Qu.:2015-04-07 09:30:00 1st Qu.: -5.00 1st Qu.:0.0000
Median :2015-07-05 08:25:00 Median : -1.00 Median :0.0000
Mean :2015-07-03 23:28:19 Mean : 10.43 Mean :0.1996
3rd Qu.:2015-09-29 21:10:00 3rd Qu.: 9.00 3rd Qu.:0.0000
Max. :2015-12-31 23:59:00 Max. :1429.00 Max. :1.0000
From the summary output above, the dataset is now down to 585k observations. Also, we see that the NA
problem has been solved, and that the is_delayed
statistics gives away that roughly 20% of flights according to our definition of delay.
Joins, aggregations
In this part, we will explore the data from different angles using basic data frame manipulation techniques with the dplyr
library.
Get airport name
We want to obtain the airport name corresponding to the airport code attached to flights.
In order to do that, we note the following:
- The
flights_df
data frame is joined withairports_df
by using themerge()
function. - The keys on which the data frames are being joined on are specified in the
by.x
andby.y
arguments. It is good to note that if those keys had the same name, it would have been possible to have the single argumentby
along with the name of that key.
result << flights_df %>%
merge(airports_df, by.x = 'origin_airport', by.y = 'iata_code')
> head(result)
origin_airport destination_airport airline flight_number scheduled_departure departure_delay
1 BOS ATL WN 1017 2015-05-19 05:50:00 -3
2 BOS ATL DL 1501 2015-05-19 07:15:00 -1
3 BOS LGA DL 2665 2015-10-02 07:00:00 -3
4 BOS LGA AA 2156 2015-10-01 20:00:00 69
is_delayed airport
1 0 Gen. Edward Lawrence Logan International Airport
2 0 Gen. Edward Lawrence Logan International Airport
3 0 Gen. Edward Lawrence Logan International Airport
4 1 Gen. Edward Lawrence Logan International Airport
Number of flights per airport
We want to obtain the number of departing flights per airport across the year.
In order to do that, we note the following:
- The
group_by()
function groups the data by a given set of columns. - The
summarize()
function computes aggregations over the grouped columns specified in the previous step. One example of aggregation function is then()
function, which computes the number of rows within each group.
result << flights_df %>%
group_by(origin_airport) %>%
summarize(nb_flights = n())
> head(result)
origin_airport nb_flights
1 BOS 115393
2 JFK 100205
3 LAX 210240
4 SFO 160067
Delayed flights for biggest airlines
Among the biggest airlines, where we define the airline size as the number of yearly flights, we want to know which airline have less delays compared to others.
In order to do that, we note the following:
- The percentage of delayed flights is first computed with the 2-step aggregation process using the
group_by()
andsummarize()
functions. - The ranking of each airline by their number of flights is then computed with the
row_number()
window function within themutate()
function. It is good to note that the ranking is done across all airlines. - The top 10 airlines with the highest volume of flighs are kept using the
filter()
function. - Names of airlines associated to their IATA code is then gathered using the
merge()
function with theairlines_df
data frame.
result << flights_df %>%
# Compute airline size and delay statistics
group_by(airline) %>%
summarize(nb_flights = n(),
perc_delayed = round(mean(is_delayed), 4)) %>%
ungroup() %>%
# Keep biggest airlines
mutate(rank_size = row_number(-nb_flights)) %>%
filter(rank_size <= 10) %>% select(-rank_size) %>%
# Display airline name
merge(airlines_df, by.x = 'airline', by.y = 'iata_code')
> head(result)
airline nb_flights perc_delayed airline_name
1 AA 85747 0.1499 American Airlines Inc.
2 AS 16196 0.1208 Alaska Airlines Inc.
3 US 16047 0.1445 US Airways Inc.
Data frame transformations
In this part, we will draw more advanced insights using data frame transformation techniques and window functions from the dplyr
and tidyr
libraries.
Temporal evolution of delays by airport
We want to obtain the monthly evolution of delays by origin airport.
In order to do that, we note the following:
- A month column is created using the
format()
function along with the%B
argument, which enables to retrieve the full month name out of a timestamp. The associated data type is a leveled factor so that it is possible to sort with thearrange()
function. - The percentage of delayed flights is computed through the 2-step aggregation process using the
group_by()
andsummarize()
functions. - The data frame is reshaped from a long to a wide format using the
spread()
function that comes from thetidyr
library.
result << flights_df %>%
# Create column specific full month name
mutate(month = factor(format(scheduled_departure, '%B'),
levels = month.name)) %>%
# Compute aggregation
group_by(origin_airport, month) %>%
summarize(perc_delayed = round(mean(is_delayed), 4)) %>%
ungroup() %>%
# Reshape data frame
spread(key = 'origin_airport', value = 'perc_delayed') %>%
# Order by month
arrange(month)
> head(result)
month BOS JFK LAX SFO
1 January 0.1902 0.2257 0.1738 0.2001
2 February 0.3248 0.3174 0.1978 0.2222
3 March 0.1984 0.2736 0.2246 0.1770
4 April 0.1553 0.2020 0.1855 0.1756
Month over month percent change in flights
We want to get the month over month percent change in flights.
In order to do that, we note the following:
- Similar to the previous section, a month column is created using the
format()
function along with the%B
argument, which enables to retrieve the full month name out of a timestamp. The associated data type is a leveled factor so that it is possible to sort with thearrange()
function. - The
lag()
window function is used along with thegroup_by()
function to find the number of flights of a given origin airport in a given month that happened one month before.
result << flights_df %>%
# Create column specific full month name
mutate(month = factor(format(scheduled_departure, '%B'),
levels = month.name)) %>%
# Compute number of flights
group_by(origin_airport, month) %>%
summarize(nb_flights = n()) %>%
ungroup() %>%
# Get lagged values
group_by(origin_airport) %>%
mutate(nb_flights_before = lag(nb_flights, order_by = month, n = 1),
perc_change = round((nb_flights - nb_flights_before) / nb_flights_before, 2)) %>%
ungroup() %>%
# Last touches
arrange(origin_airport, month)
> head(result)
origin_airport month nb_flights nb_flights_before perc_change
1 BOS January 8201 NA NA
2 BOS February 7251 8201 -0.12
3 BOS March 9744 7251 0.34
4 BOS April 9940 9744 0.02
We note that by definition, there will not be any MoM change for the first month considered.
Most popular routes by origin airport
We are interested in knowing the top 3 routes per origin airport. We define top routes from a given airport as being routes with the highest number of flights departing from that airport.
In order to do that, we note the following:
- The number of flights by route is first computed using the 2-step aggregation processing using the
group_by()
andsummarize()
functions. - Then, the rank of each route within each group of origin airport is found by first grouping by
origin_airport
and then using therow_number()
window function. - The top routes are then filtered using the
filter()
function. - Finally, the data frame is reshaped using the
spread()
function from long to wide format.
result << flights_df %>%
# Compute volume per route
group_by(origin_airport, destination_airport) %>%
summarize(nb_flights = n()) %>%
ungroup() %>%
# Compute rank within each origin airport and keep top ones
group_by(origin_airport) %>%
mutate(rank_popular = row_number(-nb_flights)) %>%
ungroup() %>%
filter(rank_popular <= 3) %>%
mutate(rank_popular = paste0('top_', rank_popular)) %>%
select(-nb_flights) %>%
# Reformat results
spread(key = rank_popular, value = destination_airport)
> head(result)
origin_airport top_1 top_2 top_3
1 BOS DCA LGA ORD
2 JFK LAX SFO MCO
3 LAX SFO JFK LAS
4 SFO LAX JFK LAS
Conclusion
Knowing how to manipulate data in a concise and effective way is crucial for anyone working with data. This is a necessary skill to have to be able to visualize data and make models off of it.