Banner

Data science tools

Tutorial on how to retrieve data with SQL

Star
SQL presto hive

By Afshine Amidi and Shervine Amidi

Motivation

The Department of Transportation publicly released a dataset that lists 5.8M flights that occurred in 2015, along with specificities such as delays, flight time and other information.

This article aims at showing good practices on how to retrieve data with SQL using practical examples on the data above. The following topics are covered:


Overview of the data

Quick peek

We want to get a sense of what the data looks like by displaying 4 random rows from the table.

In order to do that, we note the following:

We obtain the following query:

SELECT *
FROM flights
LIMIT 4;
  origin_airport destination_airport airline  flight_number tail_number  scheduled_departure
0            ATL                 ABE      EV           4940      N851AS  2015-10-01 10:27:00
1            ATL                 ABE      EV           5197      N849AS  2015-10-01 14:44:00
2            ATL                 ABE      DL           2582      N993AT  2015-10-01 21:06:00
3            ATL                 ABE      EV           4940      N872AS  2015-10-02 10:27:00

   departure_delay
0             -3.0
1              0.0
2             -3.0
3             12.0

Summary statistics

We would like to know more about the key summary statistics of the data contained in the table by retrieving the number of distinct aircrafts, number of overall flights as well as a few statistics about flights departure delays.

In order to do that, we note the following:

We obtain the following query:

SELECT
  COUNT(DISTINCT tail_number) AS nb_distinct_aircrafts,
  COUNT(flight_number) AS nb_flights,
  AVG(departure_delay) AS avg_departure_delay,
  MAX(departure_delay) AS max_departure_delay,
  MIN(departure_delay) AS min_departure_delay
FROM flights;
   nb_distinct_aircrafts  nb_flights  avg_departure_delay  max_departure_delay  min_departure_delay
0                   4769     1448914            10.289928               1496.0                -52.0

We note that the AS keyword was used to customize how column names appear in the query result.


Basic operations

In this part, we will explore the data by combining it with other sources and applying common transformations.

Get airline and airport names

We want to obtain the airport name corresponding to the airport code attached to flights.

In order to do that, we note the following:

We obtain the following query:

SELECT
  f.origin_airport AS origin_airport,
  apt.airport AS origin_airport_name,
  f.airline AS airline,
  air.airline AS airline_name,
  f.flight_number AS flight_number,
  f.scheduled_departure AS scheduled_departure
FROM flights f
LEFT JOIN airports apt
  ON (f.origin_airport = apt.iata_code)
LEFT JOIN airlines air
  ON (f.airline = air.iata_code);
  origin_airport                               origin_airport_name airline
0            ATL  Hartsfield-Jackson Atlanta International Airport      EV
1            ATL  Hartsfield-Jackson Atlanta International Airport      EV
2            ATL  Hartsfield-Jackson Atlanta International Airport      DL
3            ATL  Hartsfield-Jackson Atlanta International Airport      EV

                  airline_name  flight_number  scheduled_departure
0  Atlantic Southeast Airlines           4940  2015-10-01 10:27:00
1  Atlantic Southeast Airlines           5197  2015-10-01 14:44:00
2         Delta Air Lines Inc.           2582  2015-10-01 21:06:00
3  Atlantic Southeast Airlines           4940  2015-10-02 10:27:00
...

We note that f, apt and air have been used as table abbreviations to shorten notations.


Number of flights per airport

We want to retrieve the number of departing flights per airport for the first quarter of 2015 and sort the results in descending order.

In order to do that, we note the following:

We obtain the following query:

SELECT
  origin_airport,
  COUNT(*AS nb_flights
FROM flights
WHERE CAST(scheduled_departure AS DATE>= CAST('2015-01-01' AS DATE)
AND CAST(scheduled_departure AS DATE< CAST('2015-04-01' AS DATE)
GROUP BY 1
ORDER BY 2 DESC;
  origin_airport  nb_flights
0            ATL       89632
1            DFW       67216
2            DEN       51341
3            LAX       51116
...

We have the following remarks:


Temporal evolution of delays by airport

We want to obtain the monthly evolution of delays by airport.

In order to do that, we note the following:

We obtain the following query:

SELECT
  origin_airport,
  DATE_TRUNC('month'CAST(scheduled_departure AS DATE)) AS month,
  AVG(CASE WHEN departure_delay > 15 THEN 1
           ELSE 0 ENDAS perc_delayed
FROM flights
WHERE departure_delay IS NOT NULL
GROUP BY 1,2;
  origin_airport       month  perc_delayed
0            ATL  2015-01-01      0.155416
1            ATL  2015-02-01      0.192377
2            ATL  2015-03-01      0.173102
3            ATL  2015-04-01      0.152289
...

We note that the IS NULL / IS NOT NULL command specifies a condition on whether the column has missing values or not.


Advanced insights

In this part, we will be using more complex queries to draw advanced insights from the table.

Month over month percent change in flights

We want to get the month over month percent change in flights.

We note the following:

We obtain the following query:

WITH agg_flights AS (
SELECT
  origin_airport,
  DATE_TRUNC('month', scheduled_departure) AS month,
  COUNT(*AS nb_flights
FROM flights
GROUP BY 1,2
),

change_flights AS (
SELECT
  origin_airport,
  month,
  nb_flights,
  LAG(nb_flights, 1)
    OVER(PARTITION BY origin_airport
         ORDER BY month ASCAS nb_flights_before
FROM agg_flights
)

SELECT
  origin_airport,
  month,
  nb_flights,
  nb_flights_before,
  ROUND((1.0 * (nb_flights - nb_flights_before)) / (1.0 * (nb_flights_before)), 2)
              AS perc_change
FROM change_flights;
  origin_airport       month  nb_flights  nb_flights_before  perc_change
0            ATL  2015-01-01       29512                NaN          NaN
1            ATL  2015-02-01       27366            29512.0        -0.07
2            ATL  2015-03-01       32754            27366.0         0.20
3            ATL  2015-04-01       31370            32754.0        -0.04
...

We note that by definition, there will not be any MoM change for the first month considered.


Intensity of delays by airline

We want to determine what airlines intensity of delays are their delayed flights.

We note the following:

We obtain the following query:

WITH raw_count_delay AS (
SELECT
  airline,
  CASE WHEN departure_delay > 60*2 THEN 'big_delay'
       WHEN departure_delay > 60 THEN 'medium_delay'
       WHEN departure_delay > 15 THEN 'small_delay'
       ELSE 'no_delay' END AS delay_category,
  COUNT(*AS nb_flights
FROM flights
WHERE departure_delay IS NOT NULL
GROUP BY 1,2
)

SELECT
  airline,
  delay_category,
  (1.0 * nb_flights) /
      (1.0 * SUM(nb_flights) OVER(PARTITION BY airline)) AS perc_category
FROM raw_count_delay;
  airline delay_category  perc_category
0      AA      big_delay       0.019080
1      AA   medium_delay       0.038262
2      AA       no_delay       0.818347
3      AA    small_delay       0.124311
...

We note that when computing a ratio, it is important to have it in the (1.0 * numerator) / (1.0 * denominator) format to ensure that the result is displayed as a float number.


Most popular routes by origin airport

We are interested in determining the top 3 routes departing from each airport.

We note the following:

We obtain the following query:

WITH popular_routes AS (
SELECT
  origin_airport,
  destination_airport,
  COUNT(*AS nb_flights
FROM flights
GROUP BY 1,2
),

ranked_routes AS (
SELECT
  origin_airport,
  destination_airport,
  ROW_NUMBER() OVER(PARTITION BY origin_airport ORDER BY nb_flights DESCAS rank
FROM popular_routes
)

SELECT
  origin_airport,
  destination_airport,
  rank
FROM ranked_routes
WHERE rank <= 3
ORDER BY 1,3;
  origin_airport destination_airport  rank
0            ATL                 MCO     1
1            ATL                 LGA     2
2            ATL                 FLL     3
3            BOS                 DCA     1
4            BOS                 LGA     2
5            BOS                 ORD     3
...

Conclusion

SQL is an important language widely used in the industry to retrieve data from databases. One challenging part of wanting to get better at SQL but not having a database at disposal is that the practical aspect. Luckily, it is possible to use the pandasql package on Python and work with csv files with SQL statements.


You may also like...

Data retrieval with SQL
  • • Conditions and data types
  • • Types of joins
  • • Aggregations, window functions
  • • Table manipulation
Data manipulation with R
  • • Filtering
  • • Types of joins
  • • Aggregations, window functions
  • • Data frame transformation
Data manipulation with Python
  • • Filtering
  • • Types of joins
  • • Aggregations, window functions
  • • Data frame transformation