Search code examples
mysqlsqldatetimemysql-pythonkaggle

When applying 'WHERE' for timestamp directly and applying 'EXTRACT' showed different results (Codes Attached)


According to this Kaggle exercise on Task 5 (Write the query): https://www.kaggle.com/code/setthawutkulsrisuwan/exercise-as-with I answered 2 ways:

  1. Query with WHERE EXTRACT() to get year and month and the answer is INCORRECT.:
           WITH RelevantRides AS
           (
               SELECT EXTRACT(HOUR from trip_start_timestamp) as hour_of_day, trip_seconds, trip_miles
               FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
               WHERE EXTRACT(YEAR from trip_start_timestamp) = 2017 AND
                     EXTRACT(MONTH from trip_start_timestamp) BETWEEN 1 and 6 AND
                     trip_seconds > 0 AND
                     trip_miles > 0
           )
           SELECT hour_of_day,
                  COUNT(1) as num_trips,
                  3600 * SUM(trip_miles) / SUM(trip_seconds) as avg_mph
           FROM RelevantRides
           GROUP BY hour_of_day
           ORDER BY hour_of_day
  1. Query with the direct column name to get year and month and the answer is CORRECT.:
               WITH RelevantRides AS
           (
               SELECT EXTRACT(HOUR from trip_start_timestamp) AS hour_of_day, trip_seconds, trip_miles
               FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
               WHERE trip_start_timestamp > '2017-01-01' AND
                     trip_start_timestamp < '2017-07-01' AND
                     trip_seconds > 0 AND
                     trip_miles > 0
           )
           SELECT hour_of_day,
                  COUNT(1) as num_trips,
                  3600 * SUM(trip_miles) / SUM(trip_seconds) as avg_mph
           FROM RelevantRides
           GROUP BY hour_of_day
           ORDER BY hour_of_day

The key differences are that first one is

WHERE EXTRACT(YEAR from trip_start_timestamp) = 2017 
AND EXTRACT(MONTH from trip_start_timestamp) BETWEEN 1 and 6

, and the second is

WHERE trip_start_timestamp > '2017-01-01' AND
trip_start_timestamp < '2017-07-01' 

.

In my opinion, they should result the same as querying with EXTRACT() shows year of 2017 and month of 1 to 6 as same as querying with the direct column name; however, results aren't the same.

Please explain the reasons behind those. Thank you.


Solution

  • You're comparing constant dates to timestamps. Constant dates are actually timestamps looking like 2022-04-07 00:00:00.

    So when you want to get all records in a date range January to June you need:

    WHERE trip_start_timestamp >= '2017-01-01' 
      AND trip_start_timestamp <  '2017-07-01'  
    

    In other words you want everything on or after midnight on the first day of the range and everything up to but not including midnight on the day after the last day. In mathematical notation you want the dates in the range [2017-01-01, 2017-07-01). The beginning of the range is closed and the end is open.

    Your code like this gives a correct result.

    WHERE EXTRACT(YEAR from trip_start_timestamp) = 2017 
    AND EXTRACT(MONTH from trip_start_timestamp) BETWEEN 1 and 6
    

    But it can't exploit an index on your trip_start_timestamp column, so it won't be efficient in production.