Want more content like this? Subscribe here to be notified of new releases!

Data science tools

Tutorial on how to retrieve data with SQL

Star

SQL presto hive

Motivation

The Department of Transportation publicly released a dataset that lists 5.8M flights that occurred in 2015, along with specificities such as delays, flight time and other information.

This article aims at showing good practices on how to retrieve data with SQL using practical examples on the data above. The following topics are covered:

operations on columns
most common joins
aggregations and window functions
tips to handle complex queries

Overview of the data

Quick peek

We want to get a sense of what the data looks like by displaying 4 random rows from the table.

In order to do that, we note the following:

Columns are selected from the table of interest using the SELECT ... FROM ... command.
The * sign is a shortcut to select all columns.
The number of results that are displayed are limited using the LIMIT command.

We obtain the following query:

SELECT *
FROM flights
LIMIT 4;

  origin_airport destination_airport airline  flight_number tail_number  scheduled_departure
0            ATL                 ABE      EV           4940      N851AS  2015-10-01 10:27:00
1            ATL                 ABE      EV           5197      N849AS  2015-10-01 14:44:00
2            ATL                 ABE      DL           2582      N993AT  2015-10-01 21:06:00
3            ATL                 ABE      EV           4940      N872AS  2015-10-02 10:27:00

   departure_delay
0             -3.0
1              0.0
2             -3.0
3             12.0

Summary statistics

We would like to know more about the key summary statistics of the data contained in the table by retrieving the number of distinct aircrafts, number of overall flights as well as a few statistics about flights departure delays.

In order to do that, we note the following:

The GROUP BY command is not used since aggregations are not being computed along a specific dimension but across the whole table.
COUNT(), AVG(), MAX(), and MIN() are aggregation functions that are useful here.

We obtain the following query:

SELECT
  COUNT(DISTINCT tail_number) AS nb_distinct_aircrafts,
  COUNT(flight_number) AS nb_flights,
  AVG(departure_delay) AS avg_departure_delay,
  MAX(departure_delay) AS max_departure_delay,
  MIN(departure_delay) AS min_departure_delay
FROM flights;

   nb_distinct_aircrafts  nb_flights  avg_departure_delay  max_departure_delay  min_departure_delay
0                   4769     1448914            10.289928               1496.0                -52.0

We note that the AS keyword was used to customize how column names appear in the query result.

Basic operations

In this part, we will explore the data by combining it with other sources and applying common transformations.

Get airline and airport names

We want to obtain the airport name corresponding to the airport code attached to flights.

In order to do that, we note the following:

The tables are joined using the LEFT JOIN command, which ensures that no row gets dropped from the left table in case the key is not present in the table that it is being joined with. If all keys are present in both tables, this is equivalent to an INNER JOIN.
From a practical standpoint, when manipulating columns coming from different tables, the full path of each column needs to be specified to avoid any ambiguity as to from which table each column is coming from. A given column col from a given table table is noted table.col

We obtain the following query:

SELECT
  f.origin_airport AS origin_airport,
  apt.airport AS origin_airport_name,
  f.airline AS airline,
  air.airline AS airline_name,
  f.flight_number AS flight_number,
  f.scheduled_departure AS scheduled_departure
FROM flights f
LEFT JOIN airports apt
  ON (f.origin_airport = apt.iata_code)
LEFT JOIN airlines air
  ON (f.airline = air.iata_code);

  origin_airport                               origin_airport_name airline
0            ATL  Hartsfield-Jackson Atlanta International Airport      EV
1            ATL  Hartsfield-Jackson Atlanta International Airport      EV
2            ATL  Hartsfield-Jackson Atlanta International Airport      DL
3            ATL  Hartsfield-Jackson Atlanta International Airport      EV

                  airline_name  flight_number  scheduled_departure
0  Atlantic Southeast Airlines           4940  2015-10-01 10:27:00
1  Atlantic Southeast Airlines           5197  2015-10-01 14:44:00
2         Delta Air Lines Inc.           2582  2015-10-01 21:06:00
3  Atlantic Southeast Airlines           4940  2015-10-02 10:27:00
...

We note that f, apt and air have been used as table abbreviations to shorten notations.

Number of flights per airport

We want to retrieve the number of departing flights per airport for the first quarter of 2015 and sort the results in descending order.

In order to do that, we note the following:

Flights are filtered for the time of interest using the WHERE command as well as the CAST(... AS DATE) function.
The data is grouped by columns of interest using the GROUP BY command.
The number of flights is computed using the COUNT() aggregation function.
The result is ordered by descreasing order using the ORDER BY and DESC commands.

We obtain the following query:

SELECT
  origin_airport,
  COUNT(*) AS nb_flights
FROM flights
WHERE CAST(scheduled_departure AS DATE) >= CAST('2015-01-01' AS DATE)
AND CAST(scheduled_departure AS DATE) < CAST('2015-04-01' AS DATE)
GROUP BY 1
ORDER BY 2 DESC;

  origin_airport  nb_flights
0            ATL       89632
1            DFW       67216
2            DEN       51341
3            LAX       51116
...

We have the following remarks:

The GROUP BY 1 and ORDER BY 2 commands are shortcuts to mean that the data has been grouped by the first column and ordered along the second column specified in the query, respectively.
Instead of CAST(), the TRY_CAST() function could have been used to be more robust regarding values that cannot be cast to date.

Temporal evolution of delays by airport

We want to obtain the monthly evolution of delays by airport.

In order to do that, we note the following:

The DATE_TRUNC('month', ...) command is used to extract the month.
The data is grouped by columns of interest using the GROUP BY command.
The AVG(CASE WHEN ... ELSE ... END) command is used to compute proportions across the two categories of interest -- here, whether or not the flight was delayed.

We obtain the following query:

SELECT
  origin_airport,
  DATE_TRUNC('month', CAST(scheduled_departure AS DATE)) AS month,
  AVG(CASE WHEN departure_delay > 15 THEN 1
           ELSE 0 END) AS perc_delayed
FROM flights
WHERE departure_delay IS NOT NULL
GROUP BY 1,2;

  origin_airport       month  perc_delayed
0            ATL  2015-01-01      0.155416
1            ATL  2015-02-01      0.192377
2            ATL  2015-03-01      0.173102
3            ATL  2015-04-01      0.152289
...

We note that the IS NULL / IS NOT NULL command specifies a condition on whether the column has missing values or not.

Advanced insights

In this part, we will be using more complex queries to draw advanced insights from the table.

Month over month percent change in flights

We want to get the month over month percent change in flights.

We note the following:

In order to compute the monthly percent change in flights, the monthly number of flights first need to be computed. In an effort to make the query more readable, common table expressions (CTE) are used with the keywords WITH ... AS ....
The LAG() window function is used to get the previous month value corresponding to each time period.

We obtain the following query:

WITH agg_flights AS (
SELECT
  origin_airport,
  DATE_TRUNC('month', scheduled_departure) AS month,
  COUNT(*) AS nb_flights
FROM flights
GROUP BY 1,2
),

change_flights AS (
SELECT
  origin_airport,
  month,
  nb_flights,
  LAG(nb_flights, 1)
    OVER(PARTITION BY origin_airport
         ORDER BY month ASC) AS nb_flights_before
FROM agg_flights
)

SELECT
  origin_airport,
  month,
  nb_flights,
  nb_flights_before,
  ROUND((1.0 * (nb_flights - nb_flights_before)) / (1.0 * (nb_flights_before)), 2)
              AS perc_change
FROM change_flights;

  origin_airport       month  nb_flights  nb_flights_before  perc_change
0            ATL  2015-01-01       29512                NaN          NaN
1            ATL  2015-02-01       27366            29512.0        -0.07
2            ATL  2015-03-01       32754            27366.0         0.20
3            ATL  2015-04-01       31370            32754.0        -0.04
...

We note that by definition, there will not be any MoM change for the first month considered.

Intensity of delays by airline

We want to determine what airlines intensity of delays are their delayed flights.

We note the following:

Flight delays are put in buckets using the CASE WHEN ... ELSE ... END command.
The SUM() window function is used to get the denominator useful to obtain the proportion of each bucket.

We obtain the following query:

WITH raw_count_delay AS (
SELECT
  airline,
  CASE WHEN departure_delay > 60*2 THEN 'big_delay'
       WHEN departure_delay > 60 THEN 'medium_delay'
       WHEN departure_delay > 15 THEN 'small_delay'
       ELSE 'no_delay' END AS delay_category,
  COUNT(*) AS nb_flights
FROM flights
WHERE departure_delay IS NOT NULL
GROUP BY 1,2
)

SELECT
  airline,
  delay_category,
  (1.0 * nb_flights) /
      (1.0 * SUM(nb_flights) OVER(PARTITION BY airline)) AS perc_category
FROM raw_count_delay;

  airline delay_category  perc_category
0      AA      big_delay       0.019080
1      AA   medium_delay       0.038262
2      AA       no_delay       0.818347
3      AA    small_delay       0.124311
...

We note that when computing a ratio, it is important to have it in the (1.0 * numerator) / (1.0 * denominator) format to ensure that the result is displayed as a float number.

Most popular routes by origin airport

We are interested in determining the top 3 routes departing from each airport.

We note the following:

The number of flights per route is computed with the COUNT() aggregate function on data that is grouped by route using the GROUP BY command.
The ROW_NUMBER() window function is used to get the ranking of each route with respect to the number of flights across each origin airport. Given that window functions can only be inside the SELECT clause, the use of CTEs to decompose the query is mandatory here.
The result is sorted by increasing order of rank using the ORDER BY command.

We obtain the following query:

WITH popular_routes AS (
SELECT
  origin_airport,
  destination_airport,
  COUNT(*) AS nb_flights
FROM flights
GROUP BY 1,2
),

ranked_routes AS (
SELECT
  origin_airport,
  destination_airport,
  ROW_NUMBER() OVER(PARTITION BY origin_airport ORDER BY nb_flights DESC) AS rank
FROM popular_routes
)

SELECT
  origin_airport,
  destination_airport,
  rank
FROM ranked_routes
WHERE rank <= 3
ORDER BY 1,3;

  origin_airport destination_airport  rank
0            ATL                 MCO     1
1            ATL                 LGA     2
2            ATL                 FLL     3
3            BOS                 DCA     1
4            BOS                 LGA     2
5            BOS                 ORD     3
...

Conclusion

SQL is an important language widely used in the industry to retrieve data from databases. One challenging part of wanting to get better at SQL but not having a database at disposal is that the practical aspect. Luckily, it is possible to use the pandasql package on Python and work with csv files with SQL statements.

Data science tools

Tutorial on how to retrieve data with SQL

Motivation

Overview of the data

Quick peek

Summary statistics

Basic operations

Get airline and airport names

Number of flights per airport

Temporal evolution of delays by airport

Advanced insights

Month over month percent change in flights

Intensity of delays by airline

Most popular routes by origin airport

Conclusion

You may also like...