Search code examples
sqlprestotrino

The Average Number of Rides Completed in 4 Hours


I have a dataset with each ride having its own ride_id and its completion time. I want to know how many rides happen every 4 hours, on average.

Sample Dataset:

dropoff_datetime    ride_id
2022-08-27 11:42:02 1715
2022-08-24 05:59:26 1713
2022-08-23 17:40:05 1716
2022-08-28 23:06:01 1715
2022-08-27 03:21:29 1714

For example, I would like to find out between 2022-8-27 12 PM to 2022-8-27 4 PM how many rides happened that time? And then again from 2022-8-27 4 PM to 2022-8-27 8 PM how many rides happened in that 4 hour period?

What I've tried:

  1. I first truncate my dropoff_datetime into the hour. (DATE_TRUNC)
  2. I then group by that hour to get the count of rides per hour.

Example Query: Note: calling the above table - final.

SELECT  DATE_TRUNC('hour', dropoff_datetime) as by_hour 
,count(ride_id) as total_rides 
FROM final 
WHERE 1=1 
GROUP BY 1 

Result:

by_hour total_rides
2022-08-27 4:00:00  3756
2022-08-27 5:00:00  6710

My question is: How can I make it so it's grouping every 4 hours instead?


Solution

  • The question actually consists of two parts - how to generate date range and how to calculate the data. One possible approach is to use minimum and maximum dates in the data to generate range and then join with data again:

    -- sample data
    with dataset (dropoff_datetime, ride_id) AS
             (VALUES (timestamp  '2022-08-24 11:42:02', 1715),
                     (timestamp  '2022-08-24 05:59:26', 1713),
                     (timestamp  '2022-08-24 05:29:26', 1712),
                     (timestamp  '2022-08-23 17:40:05', 1716)),
    -- query part
    min_max as (
        select min(date_trunc('hour', dropoff_datetime)) d_min, max(date_trunc('hour', dropoff_datetime)) d_max
        from dataset
    ),
    date_ranges as (
        select h
        from min_max,
         unnest (sequence(d_min, d_max, interval '4' hour)) t(h)
    )
    
    select h, count_if(ride_id is not null)
    from date_ranges
    left join dataset on dropoff_datetime between h and h + interval '4' hour
    group by h
    order by h;
    

    Which will produce the next output:

    h _col1
    2022-08-23 17:00:00 1
    2022-08-23 21:00:00 0
    2022-08-24 01:00:00 0
    2022-08-24 05:00:00 2
    2022-08-24 09:00:00 1

    Note that this can be quite performance intensive for big amount of data.

    Another approach is to get some "reference point" and start counting from it. For example using minimum data in the dataset:

    -- sample data
    with dataset (dropoff_datetime, ride_id) AS
             (VALUES (timestamp '2022-08-27 11:42:02', 1715),
                     (timestamp '2022-08-24 05:59:26', 1713),
                     (timestamp '2022-08-24 05:29:26', 1712),
                     (timestamp '2022-08-23 17:40:05', 1716),
                     (timestamp '2022-08-28 23:06:01', 1715),
                     (timestamp '2022-08-27 03:21:29', 1714)),
    -- query part
    base_with_curr AS (
        select (select min(date_trunc('hour', dropoff_datetime)) from dataset) base,
            date_trunc('hour', dropoff_datetime) dropoff_datetime
        from dataset)
    
    select date_add('hour', (date_diff('hour', base, dropoff_datetime) / 4)*4, base) as four_hour,
           count(*)
    from base_with_curr
    group by 1;
    

    Output:

    four_hour _col1
    2022-08-23 17:00:00 1
    2022-08-28 21:00:00 1
    2022-08-24 05:00:00 2
    2022-08-27 09:00:00 1
    2022-08-27 01:00:00 1

    Then you can use sequence approach to generate missing dates if needed.