I'm trying to calculate velocity aggregations over different time windows for ids in a sql table. I have a table:
TABLE transactions (
transaction_id int,
customer_id int,
date Timestamp,
should_aggregate boolean,
)
and i'm trying to create a table with velocity features at the time of the transaction
TABLE transaction_velocity (
transaction_id int,
customer_id int,
date Timestamp,
previous_1hour int,
previous_1day int,
previous_1week int
)
For example if we had:
transaction_id | customer_id | date | should_aggregate |
---|---|---|---|
1 | 1 | 2023-01-10 01:00:00 +0000 | true |
2 | 1 | 2023-01-10 00:55:00 +0000 | false |
3 | 1 | 2023-01-09 00:57:00 +0000 | true |
4 | 1 | 2023-01-07 00:57:00 +0000 | false |
5 | 2 | 2023-01-10 00:57:00 +0000 | true |
The resulting table would be:
transaction_id | customer_id | date | previous_1hour | previous_1day | previous_1week |
---|---|---|---|---|---|
1 | 1 | 2023-01-10 01:00:00 +0000 | 1 | 1 | 3 |
3 | 1 | 2023-01-09 00:57:00 +0000 | 0 | 0 | 1 |
5 | 2 | 2023-01-10 00:57:00 +0000 | 0 | 0 | 0 |
Essentially each row is aggregated on past transactions per customer over 3 time windows if it's marked for aggregation. I've considered using partition windowing, but can't figure out how to do that over timestamp.
As you indicated, you might be able to do this with some window functions, however I think I might just recommend a self join:
SELECT
t1.transaction_id
, t1.customer_id
, t1.date
, SUM(CASE WHEN t2.date > DATE_SUB(t1.date, INTERVAL 1 HOUR) THEN 1 ELSE 0 END) previous_1hour
, SUM(CASE WHEN t2.date > DATE_SUB(t1.date, INTERVAL 1 DAY) THEN 1 ELSE 0 END) previous_1day
, SUM(CASE WHEN t2.date > DATE_SUB(t1.date, INTERVAL 1 WEEK) THEN 1 ELSE 0 END) previous_1week
FROM transactions t1 LEFT OUTER JOIN transactions t2
ON t1.customer_id = t2.customer_id AND t2.date < t1.date
WHERE t1.should_aggregate
GROUP BY t1.transaction_id, t1.customer_id, t1.date
This simply joins the records with the same customer_id
and a smaller date
, and the sum
s in the select
count them if the date
is large enough to be in the interval.
If someone has a simple way of doing this with window functions, sure, use that, but I think the above is pretty enough and easy to follow. (I just can't think of a good way of doing this with window functions, I'm not recommending against that.)
An alternative might be correlated subqueries in the select
:
SELECT
transaction_id
, customer_id
, date
, (SELECT count(*) FROM transactions t2 WHERE t2.customer_id = t1.customer_id AND t2.date > DATE_SUB(t1.date, INTERVAL 1 HOUR) AND t2.date < t1.date) previous_1hour
, (SELECT count(*) FROM transactions t2 WHERE t2.customer_id = t1.customer_id AND t2.date > DATE_SUB(t1.date, INTERVAL 1 DAY) AND t2.date < t1.date) previous_1day
, (SELECT count(*) FROM transactions t2 WHERE t2.customer_id = t1.customer_id AND t2.date > DATE_SUB(t1.date, INTERVAL 1 WEEK) AND t2.date < t1.date) previous_1week
FROM transactions t1
WHERE should_aggregate
You can see both of these options generating your expected output in this Fiddle.