Data science tools
Tutorial on how to retrieve data with SQL
SQL
presto
hive
By Afshine Amidi and Shervine Amidi
Motivation
The Department of Transportation publicly released a dataset that lists 5.8M flights that occurred in 2015, along with specificities such as delays, flight time and other information.
This article aims at showing good practices on how to retrieve data with SQL using practical examples on the data above. The following topics are covered:
- operations on columns
- most common joins
- aggregations and window functions
- tips to handle complex queries
Overview of the data
Quick peek
We want to get a sense of what the data looks like by displaying 4 random rows from the table.
In order to do that, we note the following:
- Columns are selected from the table of interest using the
SELECT ... FROM ...
command. - The
*
sign is a shortcut to select all columns. - The number of results that are displayed are limited using the
LIMIT
command.
SELECT *
FROM flights
LIMIT 4;
origin_airport destination_airport airline flight_number tail_number scheduled_departure
0 ATL ABE EV 4940 N851AS 2015-10-01 10:27:00
1 ATL ABE EV 5197 N849AS 2015-10-01 14:44:00
2 ATL ABE DL 2582 N993AT 2015-10-01 21:06:00
3 ATL ABE EV 4940 N872AS 2015-10-02 10:27:00
departure_delay
0 -3.0
1 0.0
2 -3.0
3 12.0
Summary statistics
We would like to know more about the key summary statistics of the data contained in the table by retrieving the number of distinct aircrafts, number of overall flights as well as a few statistics about flights departure delays.
In order to do that, we note the following:
- The
GROUP BY
command is not used since aggregations are not being computed along a specific dimension but across the whole table. COUNT()
,AVG()
,MAX()
, andMIN()
are aggregation functions that are useful here.
SELECT
COUNT(DISTINCT tail_number) AS nb_distinct_aircrafts,
COUNT(flight_number) AS nb_flights,
AVG(departure_delay) AS avg_departure_delay,
MAX(departure_delay) AS max_departure_delay,
MIN(departure_delay) AS min_departure_delay
FROM flights;
nb_distinct_aircrafts nb_flights avg_departure_delay max_departure_delay min_departure_delay
0 4769 1448914 10.289928 1496.0 -52.0
We note that the AS
keyword was used to customize how column names appear in the query result.
Basic operations
In this part, we will explore the data by combining it with other sources and applying common transformations.
Get airline and airport names
We want to obtain the airport name corresponding to the airport code attached to flights.
In order to do that, we note the following:
- The tables are joined using the
LEFT JOIN
command, which ensures that no row gets dropped from the left table in case the key is not present in the table that it is being joined with. If all keys are present in both tables, this is equivalent to anINNER JOIN
. - From a practical standpoint, when manipulating columns coming from different tables, the full path of each column needs to be specified to avoid any ambiguity as to from which table each column is coming from. A given column
col
from a given tabletable
is notedtable.col
SELECT
f.origin_airport AS origin_airport,
apt.airport AS origin_airport_name,
f.airline AS airline,
air.airline AS airline_name,
f.flight_number AS flight_number,
f.scheduled_departure AS scheduled_departure
FROM flights f
LEFT JOIN airports apt
ON (f.origin_airport = apt.iata_code)
LEFT JOIN airlines air
ON (f.airline = air.iata_code);
origin_airport origin_airport_name airline
0 ATL Hartsfield-Jackson Atlanta International Airport EV
1 ATL Hartsfield-Jackson Atlanta International Airport EV
2 ATL Hartsfield-Jackson Atlanta International Airport DL
3 ATL Hartsfield-Jackson Atlanta International Airport EV
airline_name flight_number scheduled_departure
0 Atlantic Southeast Airlines 4940 2015-10-01 10:27:00
1 Atlantic Southeast Airlines 5197 2015-10-01 14:44:00
2 Delta Air Lines Inc. 2582 2015-10-01 21:06:00
3 Atlantic Southeast Airlines 4940 2015-10-02 10:27:00
...
We note that f
, apt
and air
have been used as table abbreviations to shorten notations.
Number of flights per airport
We want to retrieve the number of departing flights per airport for the first quarter of 2015 and sort the results in descending order.
In order to do that, we note the following:
- Flights are filtered for the time of interest using the
WHERE
command as well as theCAST(... AS DATE)
function. - The data is grouped by columns of interest using the
GROUP BY
command. - The number of flights is computed using the
COUNT()
aggregation function. - The result is ordered by descreasing order using the
ORDER BY
andDESC
commands.
SELECT
origin_airport,
COUNT(*) AS nb_flights
FROM flights
WHERE CAST(scheduled_departure AS DATE) >= CAST('2015-01-01' AS DATE)
AND CAST(scheduled_departure AS DATE) < CAST('2015-04-01' AS DATE)
GROUP BY 1
ORDER BY 2 DESC;
origin_airport nb_flights
0 ATL 89632
1 DFW 67216
2 DEN 51341
3 LAX 51116
...
We have the following remarks:
- The
GROUP BY 1
andORDER BY 2
commands are shortcuts to mean that the data has been grouped by the first column and ordered along the second column specified in the query, respectively. - Instead of
CAST()
, theTRY_CAST()
function could have been used to be more robust regarding values that cannot be cast to date.
Temporal evolution of delays by airport
We want to obtain the monthly evolution of delays by airport.
In order to do that, we note the following:
- The
DATE_TRUNC('month', ...)
command is used to extract the month. - The data is grouped by columns of interest using the
GROUP BY
command. - The
AVG(CASE WHEN ... ELSE ... END)
command is used to compute proportions across the two categories of interest -- here, whether or not the flight was delayed.
SELECT
origin_airport,
DATE_TRUNC('month', CAST(scheduled_departure AS DATE)) AS month,
AVG(CASE WHEN departure_delay > 15 THEN 1
ELSE 0 END) AS perc_delayed
FROM flights
WHERE departure_delay IS NOT NULL
GROUP BY 1,2;
origin_airport month perc_delayed
0 ATL 2015-01-01 0.155416
1 ATL 2015-02-01 0.192377
2 ATL 2015-03-01 0.173102
3 ATL 2015-04-01 0.152289
...
We note that the IS NULL
/ IS NOT NULL
command specifies a condition on whether the column has missing values or not.
Advanced insights
In this part, we will be using more complex queries to draw advanced insights from the table.
Month over month percent change in flights
We want to get the month over month percent change in flights.
We note the following:
- In order to compute the monthly percent change in flights, the monthly number of flights first need to be computed. In an effort to make the query more readable, common table expressions (CTE) are used with the keywords
WITH ... AS ...
. - The
LAG()
window function is used to get the previous month value corresponding to each time period.
WITH agg_flights AS (
SELECT
origin_airport,
DATE_TRUNC('month', scheduled_departure) AS month,
COUNT(*) AS nb_flights
FROM flights
GROUP BY 1,2
),
change_flights AS (
SELECT
origin_airport,
month,
nb_flights,
LAG(nb_flights, 1)
OVER(PARTITION BY origin_airport
ORDER BY month ASC) AS nb_flights_before
FROM agg_flights
)
SELECT
origin_airport,
month,
nb_flights,
nb_flights_before,
ROUND((1.0 * (nb_flights - nb_flights_before)) / (1.0 * (nb_flights_before)), 2)
AS perc_change
FROM change_flights;
origin_airport month nb_flights nb_flights_before perc_change
0 ATL 2015-01-01 29512 NaN NaN
1 ATL 2015-02-01 27366 29512.0 -0.07
2 ATL 2015-03-01 32754 27366.0 0.20
3 ATL 2015-04-01 31370 32754.0 -0.04
...
We note that by definition, there will not be any MoM change for the first month considered.
Intensity of delays by airline
We want to determine what airlines intensity of delays are their delayed flights.
We note the following:
- Flight delays are put in buckets using the
CASE WHEN ... ELSE ... END
command. - The
SUM()
window function is used to get the denominator useful to obtain the proportion of each bucket.
WITH raw_count_delay AS (
SELECT
airline,
CASE WHEN departure_delay > 60*2 THEN 'big_delay'
WHEN departure_delay > 60 THEN 'medium_delay'
WHEN departure_delay > 15 THEN 'small_delay'
ELSE 'no_delay' END AS delay_category,
COUNT(*) AS nb_flights
FROM flights
WHERE departure_delay IS NOT NULL
GROUP BY 1,2
)
SELECT
airline,
delay_category,
(1.0 * nb_flights) /
(1.0 * SUM(nb_flights) OVER(PARTITION BY airline)) AS perc_category
FROM raw_count_delay;
airline delay_category perc_category
0 AA big_delay 0.019080
1 AA medium_delay 0.038262
2 AA no_delay 0.818347
3 AA small_delay 0.124311
...
We note that when computing a ratio, it is important to have it in the (1.0 * numerator) / (1.0 * denominator)
format to ensure that the result is displayed as a float number.
Most popular routes by origin airport
We are interested in determining the top 3 routes departing from each airport.
We note the following:
- The number of flights per route is computed with the
COUNT()
aggregate function on data that is grouped by route using theGROUP BY
command. - The
ROW_NUMBER()
window function is used to get the ranking of each route with respect to the number of flights across each origin airport. Given that window functions can only be inside theSELECT
clause, the use of CTEs to decompose the query is mandatory here. - The result is sorted by increasing order of rank using the
ORDER BY
command.
WITH popular_routes AS (
SELECT
origin_airport,
destination_airport,
COUNT(*) AS nb_flights
FROM flights
GROUP BY 1,2
),
ranked_routes AS (
SELECT
origin_airport,
destination_airport,
ROW_NUMBER() OVER(PARTITION BY origin_airport ORDER BY nb_flights DESC) AS rank
FROM popular_routes
)
SELECT
origin_airport,
destination_airport,
rank
FROM ranked_routes
WHERE rank <= 3
ORDER BY 1,3;
origin_airport destination_airport rank
0 ATL MCO 1
1 ATL LGA 2
2 ATL FLL 3
3 BOS DCA 1
4 BOS LGA 2
5 BOS ORD 3
...
Conclusion
SQL is an important language widely used in the industry to retrieve data from databases. One challenging part of wanting to get better at SQL but not having a database at disposal is that the practical aspect. Luckily, it is possible to use the pandasql
package on Python and work with csv files with SQL statements.