Search code examples
mysqlsqltrino

Aggregations of transactions between two timestamps by an ID in SQL


I'm trying to calculate velocity aggregations over different time windows for ids in a sql table. I have a table:

TABLE transactions (
  transaction_id int,
  customer_id int,
  date Timestamp,
  should_aggregate boolean,
)

and i'm trying to create a table with velocity features at the time of the transaction

TABLE transaction_velocity (
  transaction_id int,
  customer_id int,
  date Timestamp,
  previous_1hour int,
  previous_1day  int,
  previous_1week int
)

For example if we had:

transaction_id customer_id date should_aggregate
1 1 2023-01-10 01:00:00 +0000 true
2 1 2023-01-10 00:55:00 +0000 false
3 1 2023-01-09 00:57:00 +0000 true
4 1 2023-01-07 00:57:00 +0000 false
5 2 2023-01-10 00:57:00 +0000 true

The resulting table would be:

transaction_id customer_id date previous_1hour previous_1day previous_1week
1 1 2023-01-10 01:00:00 +0000 1 1 3
3 1 2023-01-09 00:57:00 +0000 0 0 1
5 2 2023-01-10 00:57:00 +0000 0 0 0

Essentially each row is aggregated on past transactions per customer over 3 time windows if it's marked for aggregation. I've considered using partition windowing, but can't figure out how to do that over timestamp.


Solution

  • As you indicated, you might be able to do this with some window functions, however I think I might just recommend a self join:

    SELECT
     t1.transaction_id
     , t1.customer_id
     , t1.date
     , SUM(CASE WHEN t2.date > DATE_SUB(t1.date, INTERVAL 1 HOUR) THEN 1 ELSE 0 END) previous_1hour
     , SUM(CASE WHEN t2.date > DATE_SUB(t1.date, INTERVAL 1 DAY) THEN 1 ELSE 0 END) previous_1day
     , SUM(CASE WHEN t2.date > DATE_SUB(t1.date, INTERVAL 1 WEEK) THEN 1 ELSE 0 END) previous_1week
    FROM transactions t1 LEFT OUTER JOIN transactions t2
      ON t1.customer_id = t2.customer_id AND t2.date < t1.date
    WHERE t1.should_aggregate
    GROUP BY t1.transaction_id, t1.customer_id, t1.date
    

    This simply joins the records with the same customer_id and a smaller date, and the sums in the select count them if the date is large enough to be in the interval.

    If someone has a simple way of doing this with window functions, sure, use that, but I think the above is pretty enough and easy to follow. (I just can't think of a good way of doing this with window functions, I'm not recommending against that.)

    An alternative might be correlated subqueries in the select:

    SELECT
      transaction_id
      , customer_id
      , date
      , (SELECT count(*) FROM transactions t2 WHERE t2.customer_id = t1.customer_id AND t2.date > DATE_SUB(t1.date, INTERVAL 1 HOUR) AND t2.date < t1.date) previous_1hour
      , (SELECT count(*) FROM transactions t2 WHERE t2.customer_id = t1.customer_id AND t2.date > DATE_SUB(t1.date, INTERVAL 1 DAY) AND t2.date < t1.date) previous_1day
      , (SELECT count(*) FROM transactions t2 WHERE t2.customer_id = t1.customer_id AND t2.date > DATE_SUB(t1.date, INTERVAL 1 WEEK) AND t2.date < t1.date) previous_1week
    FROM transactions t1
    WHERE should_aggregate
    

    You can see both of these options generating your expected output in this Fiddle.