Banner

Data science tools

Tutorial on how to manipulate data with Python

Star
python pandas numpy datetime os

By Afshine Amidi and Shervine Amidi

Motivation

The Department of Transportation publicly released a dataset that lists flights that occurred in 2015, along with specificities such as delays, flight time and other information.

This article aims at showing good practices to manipulate data using Python's most popular libraries. The following are covered:

import os               # File management
import pandas as pd     # Data frame manipulation
import numpy as np      # Data frame operations
import datetime as dt   # Date operations

In the parts below, we will focus on drawing insights about flights departing from BOS, JFK, SFO and LAX.


Data preprocessing

Retrieving data

We retrieve the data frames from the respective csv files and we take a peek at data types by running the following commands:

# Retrieve data
flights_df_raw = pd.read_csv('flights.csv')
airports_df = pd.read_csv('airports.csv')
airlines_df = pd.read_csv('airlines.csv')

# View data types and first rows of data frame
flights_df_raw.info()
>>> flights_df_raw.info()
RangeIndex: 5819079 entries, 0 to 5819078
Data columns (total 9 columns):
origin_airport         object
destination_airport    object
year                   int64
month                  int64
day                    int64
airline                object
flight_number          int64
scheduled_departure    int64
departure_delay        float64
dtypes: float64(1), int64(5), object(3)
memory usage: 399.6+ MB

From the output above, we see that the data frame has 5.8M observations and 9 dimensions that inform on when the flight happened, the origin and destination, any potential delays as well as other miscellaneous indicators.


Check for data quality

We want to see if there are any outliers, absurd values or missing values. In order to do that, we run the following command:

# Compute statistics of columns
flights_df_raw.describe()
>>> flights_df_raw.describe()
            year         month           day  flight_number  scheduled_departure  departure_delay
count  5819079.0  5.819079e+06  5.819079e+06   5.819079e+06         5.819079e+06     5.732926e+06
mean      2015.0  6.524085e+00  1.570459e+01   2.173093e+03         1.329602e+03     9.370158e+00
std          0.0  3.405137e+00  8.783425e+00   1.757064e+03         4.837518e+02     3.708094e+01
min       2015.0  1.000000e+00  1.000000e+00   1.000000e+00         1.000000e+00    -8.200000e+01
25%       2015.0  4.000000e+00  8.000000e+00   7.300000e+02         9.170000e+02    -5.000000e+00
50%       2015.0  7.000000e+00  1.600000e+01   1.690000e+03         1.325000e+03    -2.000000e+00
75%       2015.0  9.000000e+00  2.300000e+01   3.230000e+03         1.730000e+03     7.000000e+00
max       2015.0  1.200000e+01  3.100000e+01   9.855000e+03         2.359000e+03     1.988000e+03

From the output above, we see that there are 86k observations (~1.5%) that have null values coming from the departure_delay column. We perform the following actions:


Fix columns

In this part, we fix existing columns and add new ones that will be useful later on:


Processing summary

The code to perform the abovementioned processing steps is as follows:

flights_df = (
    flights_df_raw
    # All rows should not have any null value
    .dropna()
    # Only flights from set of airports and with reasonable delay amount
    [lambda x:
         (x['origin_airport'].isin(['BOS''SFO''LAX''JFK'])) &
         (x['departure_delay'<= 24*60)
    ]
    # Fix columns
    .assign(
        flight_number=lambda x: x['flight_number'].astype(str),
        scheduled_departure=lambda x: convert_to_timedate(x),
        is_delayed=lambda x: np.select(
            [x['departure_delay'>= 15], [1],
            default=0
        )
    )
    # Remove redundant columns
    .drop(['year''month''day'], axis=1)
)
>>> flights_df.describe()
       departure_delay     is_delayed
count    585905.000000  585905.000000
mean         10.426021       0.199613
std          37.151542       0.399709
min         -38.000000       0.000000
25%          -5.000000       0.000000
50%          -1.000000       0.000000
75%           9.000000       0.000000
max        1429.000000       1.000000

From the summary output above, the dataset is now down to 585k observations. Also, we see that the missing value problem has been solved, and that the is_delayed statistics gives away that roughly 20% of flights according to our definition of delay.


Joins, aggregations

In this part, we will explore the data from different angles using basic data frame manipulation techniques with the pandas library.

Get airport name

We want to obtain the airport name corresponding to the airport code attached to flights.

In order to do that, we note the following:

We obtain the following command:

result = (
  flights_df
  .merge(airports_df, left_on='origin_airport'right_on='iata_code')
)
>>> result.head()
  origin_airport destination_airport airline flight_number scheduled_departure
0            JFK                 ABQ      B6            65 2015-10-01 20:00:00
1            JFK                 ABQ      B6            65 2015-10-02 20:00:00
2            JFK                 ABQ      B6            65 2015-10-03 20:00:00
3            JFK                 ABQ      B6            65 2015-10-04 20:00:00

   departure_delay  is_delayed iata_code                                            airport
0            114.0           1       JFK  John F. Kennedy International Airport (New Yor...
1             27.0           1       JFK  John F. Kennedy International Airport (New Yor...
2             -2.0           0       JFK  John F. Kennedy International Airport (New Yor...
3              6.0           0       JFK  John F. Kennedy International Airport (New Yor...

Number of flights per airport

We want to obtain the number of departing flights per airport across the year.

In order to do that, we note the following:

We obtain the following code:

result = (
  flights_df
  .groupby(['origin_airport'])
  .apply(lambda x: pd.Series({
      'nb_flights'len(x['flight_number'])
  }))
  .reset_index()
)
>>> result.head()
  origin_airport  nb_flights
0            BOS      115393
1            JFK      100205
2            LAX      210240
3            SFO      160067

Delayed flights for biggest airlines

Among the biggest airlines, where we define the airline size as the number of yearly flights, we want to know which airline have less delays compared to others.

In order to do that, we note the following:

We obtain the following code:

result = (
  flights_df
  # Compute airline size and delay statistics
  .groupby(['airline'])
  .apply(lambda x: pd.Series({
      'nb_flights'len(x['flight_number']),
      'perc_delayed'round(np.average(x['is_delayed']), 4)
  }))
  .reset_index()
  # Keep biggest airlines
  .assign(rank_size=lambda x: x['nb_flights'].rank(method='first'ascending=False))
  [lambda x: x['rank_size'<= 10]
  .drop('rank_size'axis=1)
  # Display airline name
  .merge(airlines_df, left_on='airline'right_on='iata_code')
  .reset_index(drop=True)
)
>>> result.head()
  airline  nb_flights  perc_delayed iata_code            airline_name
0      AA     85747.0        0.1499        AA  American Airlines Inc.
1      AS     16196.0        0.1208        AS    Alaska Airlines Inc.
2      B6     93854.0        0.2046        B6         JetBlue Airways

Data frame transformations

In this part, we will draw more advanced insights using data frame transformation techniques and window functions from the pandas library.

Temporal evolution of delays by airport

We want to obtain the monthly evolution of delays by origin airport.

In order to do that, we note the following:

We obtain the following code:

result = (
    flights_df
    # Create column specific full month name
    .assign(
        month=lambda x: x['scheduled_departure'].dt.strftime('%B'),
        month_num=lambda x: x['scheduled_departure'].dt.strftime('%m').astype(int)
    )
    # Compute aggregation
    .groupby(['origin_airport''month''month_num'])
    .apply(lambda x: pd.Series({
        'perc_delayed'round(np.average(x['is_delayed']), 4)
    }))
    .reset_index()
    # Reshape data frame
    .pivot_table(
        columns='origin_airport',
        values='perc_delayed',
        index=['month''month_num'],
        aggfunc=np.sum
    )
    .rename_axis(None, axis=1)
    .reset_index()
    # Order by month
    .sort_values(['month_num'])
    .drop(['month_num'], axis=1)
    .reset_index(drop=True)
)
>>> result.head()
      month     BOS     JFK     LAX     SFO
0   January  0.1902  0.2257  0.1738  0.2001
1  February  0.3248  0.3174  0.1978  0.2222
2     March  0.1984  0.2736  0.2246  0.1770
3     April  0.1553  0.2020  0.1855  0.1756

Month over month percent change in flights

We want to get the month over month percent change in flights.

In order to do that, we note the following:

We obtain the following code:

result = (
    flights_df
    # Create column specific full month name
    .assign(
        month=lambda x: x['scheduled_departure'].dt.strftime('%B'),
        month_num=lambda x: x['scheduled_departure'].dt.strftime('%m').astype(int)
    )
    # Compute aggregation
    .groupby(['origin_airport''month''month_num'])
    .apply(lambda x: pd.Series({
        'nb_flights'len(x['flight_number'])
    }))
    .reset_index()
    .sort_values(['origin_airport''month_num'])
    .reset_index(drop=True)
    # Get lagged values
    .assign(
        nb_flights_before=lambda x:
            x.groupby(['origin_airport'])['nb_flights'].shift(1),
        perc_chance=lambda x:
            round((x['nb_flights'- x['nb_flights_before']) / x['nb_flights_before'], 2)
    )
    # Last touches
    .drop(['month_num'], axis=1)
)
>>> result.head()
  origin_airport     month  nb_flights  nb_flights_before  perc_chance
0            BOS   January        8201                NaN          NaN
1            BOS  February        7251             8201.0        -0.12
2            BOS     March        9744             7251.0         0.34
3            BOS     April        9940             9744.0         0.02

We note that by definition, there will not be any MoM change for the first month considered.


Most popular routes by origin airport

We are interested in knowing the top 3 routes per origin airport. We define top routes from a given airport as being routes with the highest number of flights departing from that airport.

In order to do that, we note the following:

We obtain the following code:

result = (
    flights_df
    # Compute volume per route
    .groupby(['origin_airport''destination_airport'])
    .apply(lambda x: pd.Series({
        'nb_flights'len(x['flight_number'])
    }))
    .reset_index()
    # Compute rank within each origin airport and keep top ones
    .assign(
        rank_popular=lambda x:
            x.groupby(['origin_airport'])['nb_flights'].rank(method='first'ascending=False)
    )
    [lambda x: x['rank_popular'<= 3]
    .reset_index(drop=True)
    .drop(['nb_flights'], axis=1)
    .assign(rank_popular=lambda x: 'top_' + x['rank_popular'].astype(int).map(str))
    # Reformat results
    .pivot_table(
        columns='rank_popular',
        values='destination_airport',
        index='origin_airport',
        aggfunc=np.sum
    )
    .rename_axis(None, axis=1)
    .reset_index()
)
>>> result.head()
  origin_airport top_1 top_2 top_3
0            BOS   DCA   LGA   ORD
1            JFK   LAX   SFO   MCO
2            LAX   SFO   JFK   LAS
3            SFO   LAX   JFK   LAS

Conclusion

Knowing how to manipulate data in a concise and effective way is crucial for anyone working with data. This is a necessary skill to have to be able to visualize data and make models off of it.


You may also like...

Data manipulation with Python
  • • Filtering
  • • Types of joins
  • • Aggregations, window functions
  • • Data frame transformation
Data visualization with Python
  • • Scatterplots, line plots, histograms
  • • Boxplots, maps
  • • Customized legend
Beautiful plots with Python
  • • Detailed example on how to make beautiful plots with matplotlib, seaborn, basemap