Imagine I have a table with events in it:
id user date amount
1 1 29.10.2019 16:35:01 10
2 1 29.10.2019 16:35:29 15
3 2 29.10.2019 16:48:29 12
4 2 29.10.2019 16:55:44 14
I would like to see all events like 1 and 2 (there's less than 2 minutes between them; they're from the same user) ordered by user and date.
What I've already tried:
SELECT *
FROM (
SELECT id, user, amount,
datediff(MINUTE, lag(date) OVER (ORDER BY user, d_date), date)
AS since_past_one
FROM events
) e
where since_past_one <> 0
and since_past_one <= 2
order by user, date
Problem is that lag takes value even if it is already a different user's event.
What I would like to see as a result is this:
id user date amount
1 1 29.10.2019 16:35:01 10
2 1 29.10.2019 16:35:29 15
Then let's have a few more rows and such patterns - I put an empty line every time we have a gap of over 2 seconds :
WITH input(id,usr,ts,amount) AS (
SELECT 1,1,TIMESTAMP '2019-10-29 16:35:00.0',10.00
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:01.5',10.26
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:03.0',10.52
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:04.5',10.78
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:06.0',11.03
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:07.5',11.29
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:09.0',11.55
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:10.5',11.81
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:18.0',13.10
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:19.5',13.36
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:21.0',13.62
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:27.0',14.66
UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:28.5',14.91
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:00.0',12.00
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:01.5',12.10
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:03.0',12.21
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:04.5',12.31
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:12.0',12.83
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:13.5',12.93
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:15.0',13.03
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:16.5',13.14
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:24.0',13.66
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:25.5',13.76
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:27.0',13.86
UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:28.5',13.97
)
SELECT * FROM input;
Gives me:
id | usr | ts | amount
----+-----+-----------------------+--------
1 | 1 | 2019-10-29 16:35:00 | 10.00
1 | 1 | 2019-10-29 16:35:01.5 | 10.26
1 | 1 | 2019-10-29 16:35:03 | 10.52
1 | 1 | 2019-10-29 16:35:04.5 | 10.78
1 | 1 | 2019-10-29 16:35:06 | 11.03
1 | 1 | 2019-10-29 16:35:07.5 | 11.29
1 | 1 | 2019-10-29 16:35:09 | 11.55
1 | 1 | 2019-10-29 16:35:10.5 | 11.81
1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 2019-10-29 16:35:19.5 | 13.36
1 | 1 | 2019-10-29 16:35:21 | 13.62
1 | 1 | 2019-10-29 16:35:27 | 14.66
1 | 1 | 2019-10-29 16:35:28.5 | 14.91
3 | 2 | 2019-10-29 16:35:00 | 12.00
3 | 2 | 2019-10-29 16:35:01.5 | 12.10
3 | 2 | 2019-10-29 16:35:03 | 12.21
3 | 2 | 2019-10-29 16:35:04.5 | 12.31
3 | 2 | 2019-10-29 16:35:12 | 12.83
3 | 2 | 2019-10-29 16:35:13.5 | 12.93
3 | 2 | 2019-10-29 16:35:15 | 13.03
3 | 2 | 2019-10-29 16:35:16.5 | 13.14
3 | 2 | 2019-10-29 16:35:24 | 13.66
3 | 2 | 2019-10-29 16:35:25.5 | 13.76
3 | 2 | 2019-10-29 16:35:27 | 13.86
3 | 2 | 2019-10-29 16:35:28.5 | 13.97
So we want the begin row of each group of rows that are not further apart in time than 2 seconds, for the same user. Vertica can identify such groups. The process is generally called "sessionization". We have the Vertica OLAP function CONDITIONAL_TRUE_EVENT() that does that for us: it starts with 0 for each PARTITION and increments by 1 every time the Boolean expression in the parentheses is true.
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
gives us:
session_id | id | usr | ts | amount
------------+----+-----+-----------------------+--------
0 | 1 | 1 | 2019-10-29 16:35:00 | 10.00
0 | 1 | 1 | 2019-10-29 16:35:01.5 | 10.26
0 | 1 | 1 | 2019-10-29 16:35:03 | 10.52
0 | 1 | 1 | 2019-10-29 16:35:04.5 | 10.78
0 | 1 | 1 | 2019-10-29 16:35:06 | 11.03
0 | 1 | 1 | 2019-10-29 16:35:07.5 | 11.29
0 | 1 | 1 | 2019-10-29 16:35:09 | 11.55
0 | 1 | 1 | 2019-10-29 16:35:10.5 | 11.81
1 | 1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 1 | 2019-10-29 16:35:19.5 | 13.36
1 | 1 | 1 | 2019-10-29 16:35:21 | 13.62
2 | 1 | 1 | 2019-10-29 16:35:27 | 14.66
2 | 1 | 1 | 2019-10-29 16:35:28.5 | 14.91
0 | 3 | 2 | 2019-10-29 16:35:00 | 12.00
0 | 3 | 2 | 2019-10-29 16:35:01.5 | 12.10
0 | 3 | 2 | 2019-10-29 16:35:03 | 12.21
0 | 3 | 2 | 2019-10-29 16:35:04.5 | 12.31
1 | 3 | 2 | 2019-10-29 16:35:12 | 12.83
1 | 3 | 2 | 2019-10-29 16:35:13.5 | 12.93
1 | 3 | 2 | 2019-10-29 16:35:15 | 13.03
1 | 3 | 2 | 2019-10-29 16:35:16.5 | 13.14
2 | 3 | 2 | 2019-10-29 16:35:24 | 13.66
2 | 3 | 2 | 2019-10-29 16:35:25.5 | 13.76
2 | 3 | 2 | 2019-10-29 16:35:27 | 13.86
2 | 3 | 2 | 2019-10-29 16:35:28.5 | 13.97
And, to get the first row of each group, we use the Vertica specific analytic LIMIT clause :
WITH
with_sess_id AS (
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
)
SELECT
id
, usr
, ts
, amount
FROM with_sess_id
LIMIT 1 OVER(PARTITION BY usr,session_id ORDER BY ts);
You get:
id | usr | ts | amount
----+-----+---------------------+--------
1 | 1 | 2019-10-29 16:35:00 | 10.00
1 | 1 | 2019-10-29 16:35:18 | 13.10
1 | 1 | 2019-10-29 16:35:27 | 14.66
3 | 2 | 2019-10-29 16:35:00 | 12.00
3 | 2 | 2019-10-29 16:35:12 | 12.83
3 | 2 | 2019-10-29 16:35:24 | 13.66
If I understood well your last question, you would like to have the average number of rows per session as we have determined above, and the average of the average amount per session as we have determined above. That would then be, if I understood your question right:
WITH
with_sess_id AS (
SELECT
CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
PARTITION BY usr ORDER BY ts
) AS session_id
, *
FROM input
)
,
session_summary AS (
SELECT
usr
, session_id
, COUNT(*) AS rows_per_session
, AVG(amount) AS avg_amt_per_session
FROM with_sess_id
GROUP BY 1,2
-- this returns:
-- usr | session_id | rows_per_session | avg_amt_per_session
-- -----+------------+------------------+---------------------
-- 1 | 0 | 8 | 10.905
-- 1 | 1 | 3 | 13.36
-- 1 | 2 | 2 | 14.785
-- 2 | 0 | 4 | 12.155
-- 2 | 1 | 4 | 12.9825
-- 2 | 2 | 4 | 13.8125
)
SELECT
AVG(rows_per_session) AS avg_rows_per_session
, AVG(avg_amt_per_session) AS avg_avg_amount_per_session
FROM session_summary;
avg_rows_per_session | avg_avg_amount_per_session
----------------------+----------------------------
4.16666666666667 | 13