Data science tools
Tutorial on how to manipulate data with Python
python
pandas
numpy
datetime
os
By Afshine Amidi and Shervine Amidi
Motivation
The Department of Transportation publicly released a dataset that lists flights that occurred in 2015, along with specificities such as delays, flight time and other information.
This article aims at showing good practices to manipulate data using Python's most popular libraries. The following are covered:
- cleaning data with
pandas
- make specific changes with
numpy
- handling date-related values with
datetime
import os # File management
import pandas as pd # Data frame manipulation
import numpy as np # Data frame operations
import datetime as dt # Date operations
In the parts below, we will focus on drawing insights about flights departing from BOS, JFK, SFO and LAX.
Data preprocessing
Retrieving data
We retrieve the data frames from the respective csv files and we take a peek at data types by running the following commands:
# Retrieve data
flights_df_raw = pd.read_csv('flights.csv')
airports_df = pd.read_csv('airports.csv')
airlines_df = pd.read_csv('airlines.csv')
# View data types and first rows of data frame
flights_df_raw.info()
>>> flights_df_raw.info()
RangeIndex: 5819079 entries, 0 to 5819078
Data columns (total 9 columns):
origin_airport object
destination_airport object
year int64
month int64
day int64
airline object
flight_number int64
scheduled_departure int64
departure_delay float64
dtypes: float64(1), int64(5), object(3)
memory usage: 399.6+ MB
From the output above, we see that the data frame has 5.8M observations and 9 dimensions that inform on when the flight happened, the origin and destination, any potential delays as well as other miscellaneous indicators.
Check for data quality
We want to see if there are any outliers, absurd values or missing values. In order to do that, we run the following command:
# Compute statistics of columns
flights_df_raw.describe()
>>> flights_df_raw.describe()
year month day flight_number scheduled_departure departure_delay
count 5819079.0 5.819079e+06 5.819079e+06 5.819079e+06 5.819079e+06 5.732926e+06
mean 2015.0 6.524085e+00 1.570459e+01 2.173093e+03 1.329602e+03 9.370158e+00
std 0.0 3.405137e+00 8.783425e+00 1.757064e+03 4.837518e+02 3.708094e+01
min 2015.0 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 -8.200000e+01
25% 2015.0 4.000000e+00 8.000000e+00 7.300000e+02 9.170000e+02 -5.000000e+00
50% 2015.0 7.000000e+00 1.600000e+01 1.690000e+03 1.325000e+03 -2.000000e+00
75% 2015.0 9.000000e+00 2.300000e+01 3.230000e+03 1.730000e+03 7.000000e+00
max 2015.0 1.200000e+01 3.100000e+01 9.855000e+03 2.359000e+03 1.988000e+03
From the output above, we see that there are 86k observations (~1.5%) that have null values coming from the departure_delay
column. We perform the following actions:
- We remove rows with missing values with the
dropna()
method. - We keep flights departing from airports that we want to look at with the
[lambda x: ...]
function. - Looking at the mean and median of
departure_delay
, we see that values are heavily right-skewed, and we have a maximum delay of 1988 (~ 33 hours). We choose to filter out flights that have more than 1 day delay. - We also convert
flight_number
from being integers to being character values with theastype()
method, by noting that these are IDs are have no ordered meaning.
Fix columns
In this part, we fix existing columns and add new ones that will be useful later on:
- We add the following columns using the
assign()
method, time of flight indatetime
format by combining existing columns:
def convert_to_timedate(x, col='scheduled_departure'): 'Converts to timestamp' return ( pd.to_datetime( x['year'].astype(str) + '-' + x['month'].astype(str) + '-' + x['day'].astype(str) + ' ' + x.assign(zero='0')['zero'] * (4 - x[col].astype(str).str.len()) + x[col].astype(str), format='%Y-%m-%d %H%M', errors='coerce' ) )
- We also want to have a binary answer to whether a flight was delayed or not. Using the
departure_delay
column that gives the actual number of delay minutes, we consider that any flight above 15 minutes delay are delayed, and any other flight is not delayed. We add this flag by using thenp.select()
command.
Processing summary
The code to perform the abovementioned processing steps is as follows:
flights_df = (
flights_df_raw
# All rows should not have any null value
.dropna()
# Only flights from set of airports and with reasonable delay amount
[lambda x:
(x['origin_airport'].isin(['BOS', 'SFO', 'LAX', 'JFK'])) &
(x['departure_delay'] <= 24*60)
]
# Fix columns
.assign(
flight_number=lambda x: x['flight_number'].astype(str),
scheduled_departure=lambda x: convert_to_timedate(x),
is_delayed=lambda x: np.select(
[x['departure_delay'] >= 15], [1],
default=0
)
)
# Remove redundant columns
.drop(['year', 'month', 'day'], axis=1)
)
>>> flights_df.describe()
departure_delay is_delayed
count 585905.000000 585905.000000
mean 10.426021 0.199613
std 37.151542 0.399709
min -38.000000 0.000000
25% -5.000000 0.000000
50% -1.000000 0.000000
75% 9.000000 0.000000
max 1429.000000 1.000000
From the summary output above, the dataset is now down to 585k observations. Also, we see that the missing value problem has been solved, and that the is_delayed
statistics gives away that roughly 20% of flights according to our definition of delay.
Joins, aggregations
In this part, we will explore the data from different angles using basic data frame manipulation techniques with the pandas
library.
Get airport name
We want to obtain the airport name corresponding to the airport code attached to flights.
In order to do that, we note the following:
- The
flights_df
data frame is joined withairports_df
by using themerge()
method. - The keys on which the data frames are being joined on are specified in the
left_on
andright_on
arguments. It is good to note that if those keys had the same name, it would have been possible to have the single argumenton
along with the name of that key.
result = (
flights_df
.merge(airports_df, left_on='origin_airport', right_on='iata_code')
)
>>> result.head()
origin_airport destination_airport airline flight_number scheduled_departure
0 JFK ABQ B6 65 2015-10-01 20:00:00
1 JFK ABQ B6 65 2015-10-02 20:00:00
2 JFK ABQ B6 65 2015-10-03 20:00:00
3 JFK ABQ B6 65 2015-10-04 20:00:00
departure_delay is_delayed iata_code airport
0 114.0 1 JFK John F. Kennedy International Airport (New Yor...
1 27.0 1 JFK John F. Kennedy International Airport (New Yor...
2 -2.0 0 JFK John F. Kennedy International Airport (New Yor...
3 6.0 0 JFK John F. Kennedy International Airport (New Yor...
Number of flights per airport
We want to obtain the number of departing flights per airport across the year.
In order to do that, we note the following:
- The
groupby()
method groups the data by a given set of columns. - The
apply()
method computes aggregations over the grouped columns specified in the previous step. One example of aggregation function is thelen()
function, which computes the number of rows within each group.
result = (
flights_df
.groupby(['origin_airport'])
.apply(lambda x: pd.Series({
'nb_flights': len(x['flight_number'])
}))
.reset_index()
)
>>> result.head()
origin_airport nb_flights
0 BOS 115393
1 JFK 100205
2 LAX 210240
3 SFO 160067
Delayed flights for biggest airlines
Among the biggest airlines, where we define the airline size as the number of yearly flights, we want to know which airline have less delays compared to others.
In order to do that, we note the following:
- The percentage of delayed flights is first computed with the 2-step aggregation process using the
groupby()
andapply()
functions. - The ranking of each airline by their number of flights is then computed with the
rank(method='first')
window function within theassign()
method. It is good to note that the ranking is done across all airlines. - The top 10 airlines with the highest volume of flighs are kept using the
[lambda x: ...]
function. - Names of airlines associated to their IATA code is then gathered using the
merge()
method with theairlines_df
data frame.
result = (
flights_df
# Compute airline size and delay statistics
.groupby(['airline'])
.apply(lambda x: pd.Series({
'nb_flights': len(x['flight_number']),
'perc_delayed': round(np.average(x['is_delayed']), 4)
}))
.reset_index()
# Keep biggest airlines
.assign(rank_size=lambda x: x['nb_flights'].rank(method='first', ascending=False))
[lambda x: x['rank_size'] <= 10]
.drop('rank_size', axis=1)
# Display airline name
.merge(airlines_df, left_on='airline', right_on='iata_code')
.reset_index(drop=True)
)
>>> result.head()
airline nb_flights perc_delayed iata_code airline_name
0 AA 85747.0 0.1499 AA American Airlines Inc.
1 AS 16196.0 0.1208 AS Alaska Airlines Inc.
2 B6 93854.0 0.2046 B6 JetBlue Airways
Data frame transformations
In this part, we will draw more advanced insights using data frame transformation techniques and window functions from the pandas
library.
Temporal evolution of delays by airport
We want to obtain the monthly evolution of delays by origin airport.
In order to do that, we note the following:
- A month column is created using the
strftime()
method along with the'%B'
and'%m'
arguments, which enables to retrieve the full month name and the month number of a timestamp respectively. The associated data type is a leveled factor so that it is possible to sort with thesort_values()
method. - The percentage of delayed flights is computed through the 2-step aggregation process using the
groupby()
andapply()
methods. - The data frame is reshaped from a long to a wide format using the
pivot_table()
method that comes from thepandas
library.
result = (
flights_df
# Create column specific full month name
.assign(
month=lambda x: x['scheduled_departure'].dt.strftime('%B'),
month_num=lambda x: x['scheduled_departure'].dt.strftime('%m').astype(int)
)
# Compute aggregation
.groupby(['origin_airport', 'month', 'month_num'])
.apply(lambda x: pd.Series({
'perc_delayed': round(np.average(x['is_delayed']), 4)
}))
.reset_index()
# Reshape data frame
.pivot_table(
columns='origin_airport',
values='perc_delayed',
index=['month', 'month_num'],
aggfunc=np.sum
)
.rename_axis(None, axis=1)
.reset_index()
# Order by month
.sort_values(['month_num'])
.drop(['month_num'], axis=1)
.reset_index(drop=True)
)
>>> result.head()
month BOS JFK LAX SFO
0 January 0.1902 0.2257 0.1738 0.2001
1 February 0.3248 0.3174 0.1978 0.2222
2 March 0.1984 0.2736 0.2246 0.1770
3 April 0.1553 0.2020 0.1855 0.1756
Month over month percent change in flights
We want to get the month over month percent change in flights.
In order to do that, we note the following:
- Similar to the previous section, a month column is created using the
strftime()
method along with the'%B'
and'%m'
arguments, which enables to retrieve the full month name and the month number of a timestamp respectively. The associated data type is a leveled factor so that it is possible to sort with thesort_values()
function. - The
shift()
window function is used along with thegroupby()
function to find the number of flights of a given origin airport in a given month that happened one month before.
result = (
flights_df
# Create column specific full month name
.assign(
month=lambda x: x['scheduled_departure'].dt.strftime('%B'),
month_num=lambda x: x['scheduled_departure'].dt.strftime('%m').astype(int)
)
# Compute aggregation
.groupby(['origin_airport', 'month', 'month_num'])
.apply(lambda x: pd.Series({
'nb_flights': len(x['flight_number'])
}))
.reset_index()
.sort_values(['origin_airport', 'month_num'])
.reset_index(drop=True)
# Get lagged values
.assign(
nb_flights_before=lambda x:
x.groupby(['origin_airport'])['nb_flights'].shift(1),
perc_chance=lambda x:
round((x['nb_flights'] - x['nb_flights_before']) / x['nb_flights_before'], 2)
)
# Last touches
.drop(['month_num'], axis=1)
)
>>> result.head()
origin_airport month nb_flights nb_flights_before perc_chance
0 BOS January 8201 NaN NaN
1 BOS February 7251 8201.0 -0.12
2 BOS March 9744 7251.0 0.34
3 BOS April 9940 9744.0 0.02
We note that by definition, there will not be any MoM change for the first month considered.
Most popular routes by origin airport
We are interested in knowing the top 3 routes per origin airport. We define top routes from a given airport as being routes with the highest number of flights departing from that airport.
In order to do that, we note the following:
- The number of flights by route is first computed using the 2-step aggregation processing using the
groupby()
andapply()
functions. - Then, the rank of each route within each group of origin airport is found by first grouping by
origin_airport
and then using therank(method='first')
window function. - The top routes are then filtered using the
[lambda x: ...]
step. - Finally, the data frame is reshaped using the
pivot_table()
method from long to wide format.
result = (
flights_df
# Compute volume per route
.groupby(['origin_airport', 'destination_airport'])
.apply(lambda x: pd.Series({
'nb_flights': len(x['flight_number'])
}))
.reset_index()
# Compute rank within each origin airport and keep top ones
.assign(
rank_popular=lambda x:
x.groupby(['origin_airport'])['nb_flights'].rank(method='first', ascending=False)
)
[lambda x: x['rank_popular'] <= 3]
.reset_index(drop=True)
.drop(['nb_flights'], axis=1)
.assign(rank_popular=lambda x: 'top_' + x['rank_popular'].astype(int).map(str))
# Reformat results
.pivot_table(
columns='rank_popular',
values='destination_airport',
index='origin_airport',
aggfunc=np.sum
)
.rename_axis(None, axis=1)
.reset_index()
)
>>> result.head()
origin_airport top_1 top_2 top_3
0 BOS DCA LGA ORD
1 JFK LAX SFO MCO
2 LAX SFO JFK LAS
3 SFO LAX JFK LAS
Conclusion
Knowing how to manipulate data in a concise and effective way is crucial for anyone working with data. This is a necessary skill to have to be able to visualize data and make models off of it.