I have a dataset with each ride having its own ride_id and its completion time. I want to know how many rides happen every 4 hours, on average.
Sample Dataset:
dropoff_datetime ride_id
2022-08-27 11:42:02 1715
2022-08-24 05:59:26 1713
2022-08-23 17:40:05 1716
2022-08-28 23:06:01 1715
2022-08-27 03:21:29 1714
For example, I would like to find out between 2022-8-27 12 PM to 2022-8-27 4 PM how many rides happened that time? And then again from 2022-8-27 4 PM to 2022-8-27 8 PM how many rides happened in that 4 hour period?
What I've tried:
Example Query: Note: calling the above table - final.
SELECT DATE_TRUNC('hour', dropoff_datetime) as by_hour
,count(ride_id) as total_rides
FROM final
WHERE 1=1
GROUP BY 1
Result:
by_hour total_rides
2022-08-27 4:00:00 3756
2022-08-27 5:00:00 6710
My question is: How can I make it so it's grouping every 4 hours instead?
The question actually consists of two parts - how to generate date range and how to calculate the data. One possible approach is to use minimum and maximum dates in the data to generate range and then join with data again:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-24 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716)),
-- query part
min_max as (
select min(date_trunc('hour', dropoff_datetime)) d_min, max(date_trunc('hour', dropoff_datetime)) d_max
from dataset
),
date_ranges as (
select h
from min_max,
unnest (sequence(d_min, d_max, interval '4' hour)) t(h)
)
select h, count_if(ride_id is not null)
from date_ranges
left join dataset on dropoff_datetime between h and h + interval '4' hour
group by h
order by h;
Which will produce the next output:
h | _col1 |
---|---|
2022-08-23 17:00:00 | 1 |
2022-08-23 21:00:00 | 0 |
2022-08-24 01:00:00 | 0 |
2022-08-24 05:00:00 | 2 |
2022-08-24 09:00:00 | 1 |
Note that this can be quite performance intensive for big amount of data.
Another approach is to get some "reference point" and start counting from it. For example using minimum data in the dataset:
-- sample data
with dataset (dropoff_datetime, ride_id) AS
(VALUES (timestamp '2022-08-27 11:42:02', 1715),
(timestamp '2022-08-24 05:59:26', 1713),
(timestamp '2022-08-24 05:29:26', 1712),
(timestamp '2022-08-23 17:40:05', 1716),
(timestamp '2022-08-28 23:06:01', 1715),
(timestamp '2022-08-27 03:21:29', 1714)),
-- query part
base_with_curr AS (
select (select min(date_trunc('hour', dropoff_datetime)) from dataset) base,
date_trunc('hour', dropoff_datetime) dropoff_datetime
from dataset)
select date_add('hour', (date_diff('hour', base, dropoff_datetime) / 4)*4, base) as four_hour,
count(*)
from base_with_curr
group by 1;
Output:
four_hour | _col1 |
---|---|
2022-08-23 17:00:00 | 1 |
2022-08-28 21:00:00 | 1 |
2022-08-24 05:00:00 | 2 |
2022-08-27 09:00:00 | 1 |
2022-08-27 01:00:00 | 1 |
Then you can use sequence approach to generate missing dates if needed.