Want more content like this? Subscribe here to be notified of new releases!

Data science tools

Tutorial on how to manipulate data with Python

Star

python pandas numpy datetime os

Motivation

The Department of Transportation publicly released a dataset that lists flights that occurred in 2015, along with specificities such as delays, flight time and other information.

This article aims at showing good practices to manipulate data using Python's most popular libraries. The following are covered:

cleaning data with pandas
make specific changes with numpy
handling date-related values with datetime

import os               # File management
import pandas as pd     # Data frame manipulation
import numpy as np      # Data frame operations
import datetime as dt   # Date operations

In the parts below, we will focus on drawing insights about flights departing from BOS, JFK, SFO and LAX.

Data preprocessing

Retrieving data

We retrieve the data frames from the respective csv files and we take a peek at data types by running the following commands:

# Retrieve data
flights_df_raw = pd.read_csv('flights.csv')
airports_df = pd.read_csv('airports.csv')
airlines_df = pd.read_csv('airlines.csv')

# View data types and first rows of data frame
flights_df_raw.info()

>>> flights_df_raw.info()
RangeIndex: 5819079 entries, 0 to 5819078
Data columns (total 9 columns):
origin_airport         object
destination_airport    object
year                   int64
month                  int64
day                    int64
airline                object
flight_number          int64
scheduled_departure    int64
departure_delay        float64
dtypes: float64(1), int64(5), object(3)
memory usage: 399.6+ MB

From the output above, we see that the data frame has 5.8M observations and 9 dimensions that inform on when the flight happened, the origin and destination, any potential delays as well as other miscellaneous indicators.

Check for data quality

We want to see if there are any outliers, absurd values or missing values. In order to do that, we run the following command:

# Compute statistics of columns
flights_df_raw.describe()

>>> flights_df_raw.describe()
            year         month           day  flight_number  scheduled_departure  departure_delay
count  5819079.0  5.819079e+06  5.819079e+06   5.819079e+06         5.819079e+06     5.732926e+06
mean      2015.0  6.524085e+00  1.570459e+01   2.173093e+03         1.329602e+03     9.370158e+00
std          0.0  3.405137e+00  8.783425e+00   1.757064e+03         4.837518e+02     3.708094e+01
min       2015.0  1.000000e+00  1.000000e+00   1.000000e+00         1.000000e+00    -8.200000e+01
25%       2015.0  4.000000e+00  8.000000e+00   7.300000e+02         9.170000e+02    -5.000000e+00
50%       2015.0  7.000000e+00  1.600000e+01   1.690000e+03         1.325000e+03    -2.000000e+00
75%       2015.0  9.000000e+00  2.300000e+01   3.230000e+03         1.730000e+03     7.000000e+00
max       2015.0  1.200000e+01  3.100000e+01   9.855000e+03         2.359000e+03     1.988000e+03

From the output above, we see that there are 86k observations (~1.5%) that have null values coming from the departure_delay column. We perform the following actions:

We remove rows with missing values with the dropna() method.
We keep flights departing from airports that we want to look at with the [lambda x: ...] function.
Looking at the mean and median of departure_delay, we see that values are heavily right-skewed, and we have a maximum delay of 1988 (~ 33 hours). We choose to filter out flights that have more than 1 day delay.
We also convert flight_number from being integers to being character values with the astype() method, by noting that these are IDs are have no ordered meaning.

Fix columns

In this part, we fix existing columns and add new ones that will be useful later on:

We add the following columns using the assign() method, time of flight in datetime format by combining existing columns:

def convert_to_timedate(x, col='scheduled_departure'):
    'Converts to timestamp'

    return (
        pd.to_datetime(
            x['year'].astype(str) + '-' + x['month'].astype(str) + '-' + x['day'].astype(str) +
            ' ' + x.assign(zero='0')['zero'] * (4 - x[col].astype(str).str.len()) +
            x[col].astype(str), format='%Y-%m-%d %H%M', errors='coerce'
        )
    )

We also want to have a binary answer to whether a flight was delayed or not. Using the departure_delay column that gives the actual number of delay minutes, we consider that any flight above 15 minutes delay are delayed, and any other flight is not delayed. We add this flag by using the np.select() command.

Processing summary

The code to perform the abovementioned processing steps is as follows:

flights_df = (
    flights_df_raw
    # All rows should not have any null value
    .dropna()
    # Only flights from set of airports and with reasonable delay amount
    [lambda x:
         (x['origin_airport'].isin(['BOS', 'SFO', 'LAX', 'JFK'])) &
         (x['departure_delay'] <= 24*60)
    ]
    # Fix columns
    .assign(
        flight_number=lambda x: x['flight_number'].astype(str),
        scheduled_departure=lambda x: convert_to_timedate(x),
        is_delayed=lambda x: np.select(
            [x['departure_delay'] >= 15], [1],
            default=0
        )
    )
    # Remove redundant columns
    .drop(['year', 'month', 'day'], axis=1)
)

>>> flights_df.describe()
       departure_delay     is_delayed
count    585905.000000  585905.000000
mean         10.426021       0.199613
std          37.151542       0.399709
min         -38.000000       0.000000
25%          -5.000000       0.000000
50%          -1.000000       0.000000
75%           9.000000       0.000000
max        1429.000000       1.000000

From the summary output above, the dataset is now down to 585k observations. Also, we see that the missing value problem has been solved, and that the is_delayed statistics gives away that roughly 20% of flights according to our definition of delay.

Joins, aggregations

In this part, we will explore the data from different angles using basic data frame manipulation techniques with the pandas library.

Get airport name

We want to obtain the airport name corresponding to the airport code attached to flights.

In order to do that, we note the following:

The flights_df data frame is joined with airports_df by using the merge() method.
The keys on which the data frames are being joined on are specified in the left_on and right_on arguments. It is good to note that if those keys had the same name, it would have been possible to have the single argument on along with the name of that key.

We obtain the following command:

result = (
  flights_df
  .merge(airports_df, left_on='origin_airport', right_on='iata_code')
)

>>> result.head()
  origin_airport destination_airport airline flight_number scheduled_departure
0            JFK                 ABQ      B6            65 2015-10-01 20:00:00
1            JFK                 ABQ      B6            65 2015-10-02 20:00:00
2            JFK                 ABQ      B6            65 2015-10-03 20:00:00
3            JFK                 ABQ      B6            65 2015-10-04 20:00:00

   departure_delay  is_delayed iata_code                                            airport
0            114.0           1       JFK  John F. Kennedy International Airport (New Yor...
1             27.0           1       JFK  John F. Kennedy International Airport (New Yor...
2             -2.0           0       JFK  John F. Kennedy International Airport (New Yor...
3              6.0           0       JFK  John F. Kennedy International Airport (New Yor...

Number of flights per airport

We want to obtain the number of departing flights per airport across the year.

In order to do that, we note the following:

The groupby() method groups the data by a given set of columns.
The apply() method computes aggregations over the grouped columns specified in the previous step. One example of aggregation function is the len() function, which computes the number of rows within each group.

We obtain the following code:

result = (
  flights_df
  .groupby(['origin_airport'])
  .apply(lambda x: pd.Series({
      'nb_flights': len(x['flight_number'])
  }))
  .reset_index()
)

>>> result.head()
  origin_airport  nb_flights
0            BOS      115393
1            JFK      100205
2            LAX      210240
3            SFO      160067

Delayed flights for biggest airlines

Among the biggest airlines, where we define the airline size as the number of yearly flights, we want to know which airline have less delays compared to others.

In order to do that, we note the following:

The percentage of delayed flights is first computed with the 2-step aggregation process using the groupby() and apply() functions.
The ranking of each airline by their number of flights is then computed with the rank(method='first') window function within the assign() method. It is good to note that the ranking is done across all airlines.
The top 10 airlines with the highest volume of flighs are kept using the [lambda x: ...] function.
Names of airlines associated to their IATA code is then gathered using the merge() method with the airlines_df data frame.

We obtain the following code:

result = (
  flights_df
  # Compute airline size and delay statistics
  .groupby(['airline'])
  .apply(lambda x: pd.Series({
      'nb_flights': len(x['flight_number']),
      'perc_delayed': round(np.average(x['is_delayed']), 4)
  }))
  .reset_index()
  # Keep biggest airlines
  .assign(rank_size=lambda x: x['nb_flights'].rank(method='first', ascending=False))
  [lambda x: x['rank_size'] <= 10]
  .drop('rank_size', axis=1)
  # Display airline name
  .merge(airlines_df, left_on='airline', right_on='iata_code')
  .reset_index(drop=True)
)

>>> result.head()
  airline  nb_flights  perc_delayed iata_code            airline_name
0      AA     85747.0        0.1499        AA  American Airlines Inc.
1      AS     16196.0        0.1208        AS    Alaska Airlines Inc.
2      B6     93854.0        0.2046        B6         JetBlue Airways

Data frame transformations

In this part, we will draw more advanced insights using data frame transformation techniques and window functions from the pandas library.

Temporal evolution of delays by airport

We want to obtain the monthly evolution of delays by origin airport.

In order to do that, we note the following:

A month column is created using the strftime() method along with the '%B' and '%m' arguments, which enables to retrieve the full month name and the month number of a timestamp respectively. The associated data type is a leveled factor so that it is possible to sort with the sort_values() method.
The percentage of delayed flights is computed through the 2-step aggregation process using the groupby() and apply() methods.
The data frame is reshaped from a long to a wide format using the pivot_table() method that comes from the pandas library.

We obtain the following code:

result = (
    flights_df
    # Create column specific full month name
    .assign(
        month=lambda x: x['scheduled_departure'].dt.strftime('%B'),
        month_num=lambda x: x['scheduled_departure'].dt.strftime('%m').astype(int)
    )
    # Compute aggregation
    .groupby(['origin_airport', 'month', 'month_num'])
    .apply(lambda x: pd.Series({
        'perc_delayed': round(np.average(x['is_delayed']), 4)
    }))
    .reset_index()
    # Reshape data frame
    .pivot_table(
        columns='origin_airport',
        values='perc_delayed',
        index=['month', 'month_num'],
        aggfunc=np.sum
    )
    .rename_axis(None, axis=1)
    .reset_index()
    # Order by month
    .sort_values(['month_num'])
    .drop(['month_num'], axis=1)
    .reset_index(drop=True)
)

>>> result.head()
      month     BOS     JFK     LAX     SFO
0   January  0.1902  0.2257  0.1738  0.2001
1  February  0.3248  0.3174  0.1978  0.2222
2     March  0.1984  0.2736  0.2246  0.1770
3     April  0.1553  0.2020  0.1855  0.1756

Month over month percent change in flights

We want to get the month over month percent change in flights.

In order to do that, we note the following:

Similar to the previous section, a month column is created using the strftime() method along with the '%B' and '%m' arguments, which enables to retrieve the full month name and the month number of a timestamp respectively. The associated data type is a leveled factor so that it is possible to sort with the sort_values() function.
The shift() window function is used along with the groupby() function to find the number of flights of a given origin airport in a given month that happened one month before.

We obtain the following code:

result = (
    flights_df
    # Create column specific full month name
    .assign(
        month=lambda x: x['scheduled_departure'].dt.strftime('%B'),
        month_num=lambda x: x['scheduled_departure'].dt.strftime('%m').astype(int)
    )
    # Compute aggregation
    .groupby(['origin_airport', 'month', 'month_num'])
    .apply(lambda x: pd.Series({
        'nb_flights': len(x['flight_number'])
    }))
    .reset_index()
    .sort_values(['origin_airport', 'month_num'])
    .reset_index(drop=True)
    # Get lagged values
    .assign(
        nb_flights_before=lambda x:
            x.groupby(['origin_airport'])['nb_flights'].shift(1),
        perc_chance=lambda x:
            round((x['nb_flights'] - x['nb_flights_before']) / x['nb_flights_before'], 2)
    )
    # Last touches
    .drop(['month_num'], axis=1)
)

>>> result.head()
  origin_airport     month  nb_flights  nb_flights_before  perc_chance
0            BOS   January        8201                NaN          NaN
1            BOS  February        7251             8201.0        -0.12
2            BOS     March        9744             7251.0         0.34
3            BOS     April        9940             9744.0         0.02

We note that by definition, there will not be any MoM change for the first month considered.

Most popular routes by origin airport

We are interested in knowing the top 3 routes per origin airport. We define top routes from a given airport as being routes with the highest number of flights departing from that airport.

In order to do that, we note the following:

The number of flights by route is first computed using the 2-step aggregation processing using the groupby() and apply() functions.
Then, the rank of each route within each group of origin airport is found by first grouping by origin_airport and then using the rank(method='first') window function.
The top routes are then filtered using the [lambda x: ...] step.
Finally, the data frame is reshaped using the pivot_table() method from long to wide format.

We obtain the following code:

result = (
    flights_df
    # Compute volume per route
    .groupby(['origin_airport', 'destination_airport'])
    .apply(lambda x: pd.Series({
        'nb_flights': len(x['flight_number'])
    }))
    .reset_index()
    # Compute rank within each origin airport and keep top ones
    .assign(
        rank_popular=lambda x:
            x.groupby(['origin_airport'])['nb_flights'].rank(method='first', ascending=False)
    )
    [lambda x: x['rank_popular'] <= 3]
    .reset_index(drop=True)
    .drop(['nb_flights'], axis=1)
    .assign(rank_popular=lambda x: 'top_' + x['rank_popular'].astype(int).map(str))
    # Reformat results
    .pivot_table(
        columns='rank_popular',
        values='destination_airport',
        index='origin_airport',
        aggfunc=np.sum
    )
    .rename_axis(None, axis=1)
    .reset_index()
)

>>> result.head()
  origin_airport top_1 top_2 top_3
0            BOS   DCA   LGA   ORD
1            JFK   LAX   SFO   MCO
2            LAX   SFO   JFK   LAS
3            SFO   LAX   JFK   LAS

Conclusion

Knowing how to manipulate data in a concise and effective way is crucial for anyone working with data. This is a necessary skill to have to be able to visualize data and make models off of it.

Tutorial on how to manipulate data with Python

Motivation

Data preprocessing

Retrieving data

Check for data quality

Fix columns

Processing summary

Joins, aggregations

Get airport name

Number of flights per airport

Delayed flights for biggest airlines

Data frame transformations

Temporal evolution of delays by airport

Month over month percent change in flights

Most popular routes by origin airport

Conclusion

You may also like...