Search code examples
sqlvertica

How to show events with 2 or less minutes between them from the same user?


Imagine I have a table with events in it:

id user date                 amount
1  1    29.10.2019 16:35:01  10
2  1    29.10.2019 16:35:29  15
3  2    29.10.2019 16:48:29  12
4  2    29.10.2019 16:55:44  14

I would like to see all events like 1 and 2 (there's less than 2 minutes between them; they're from the same user) ordered by user and date.

What I've already tried:

SELECT *
FROM (
SELECT id, user, amount, 
datediff(MINUTE, lag(date) OVER (ORDER BY user, d_date), date) 
AS since_past_one
FROM events
) e
where since_past_one <> 0
and since_past_one <= 2
order by user, date

Problem is that lag takes value even if it is already a different user's event.

What I would like to see as a result is this:

id user date                 amount
1  1    29.10.2019 16:35:01  10
2  1    29.10.2019 16:35:29  15

Solution

  • Then let's have a few more rows and such patterns - I put an empty line every time we have a gap of over 2 seconds :

    WITH input(id,usr,ts,amount) AS ( 
              SELECT 1,1,TIMESTAMP '2019-10-29 16:35:00.0',10.00 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:01.5',10.26 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:03.0',10.52 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:04.5',10.78 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:06.0',11.03 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:07.5',11.29 
    
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:09.0',11.55 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:10.5',11.81 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:18.0',13.10 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:19.5',13.36 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:21.0',13.62 
    
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:27.0',14.66 
    UNION ALL SELECT 1,1,TIMESTAMP '2019-10-29 16:35:28.5',14.91 
    
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:00.0',12.00 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:01.5',12.10 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:03.0',12.21 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:04.5',12.31 
    
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:12.0',12.83 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:13.5',12.93 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:15.0',13.03 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:16.5',13.14 
    
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:24.0',13.66 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:25.5',13.76 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:27.0',13.86 
    UNION ALL SELECT 3,2,TIMESTAMP '2019-10-29 16:35:28.5',13.97 
    ) 
    SELECT * FROM input; 
    

    Gives me:

     id | usr |          ts           | amount                                  
    ----+-----+-----------------------+--------
      1 |   1 | 2019-10-29 16:35:00   |  10.00
      1 |   1 | 2019-10-29 16:35:01.5 |  10.26
      1 |   1 | 2019-10-29 16:35:03   |  10.52
      1 |   1 | 2019-10-29 16:35:04.5 |  10.78
      1 |   1 | 2019-10-29 16:35:06   |  11.03
      1 |   1 | 2019-10-29 16:35:07.5 |  11.29
      1 |   1 | 2019-10-29 16:35:09   |  11.55
      1 |   1 | 2019-10-29 16:35:10.5 |  11.81
      1 |   1 | 2019-10-29 16:35:18   |  13.10
      1 |   1 | 2019-10-29 16:35:19.5 |  13.36
      1 |   1 | 2019-10-29 16:35:21   |  13.62
      1 |   1 | 2019-10-29 16:35:27   |  14.66
      1 |   1 | 2019-10-29 16:35:28.5 |  14.91
      3 |   2 | 2019-10-29 16:35:00   |  12.00
      3 |   2 | 2019-10-29 16:35:01.5 |  12.10
      3 |   2 | 2019-10-29 16:35:03   |  12.21
      3 |   2 | 2019-10-29 16:35:04.5 |  12.31
      3 |   2 | 2019-10-29 16:35:12   |  12.83
      3 |   2 | 2019-10-29 16:35:13.5 |  12.93
      3 |   2 | 2019-10-29 16:35:15   |  13.03
      3 |   2 | 2019-10-29 16:35:16.5 |  13.14
      3 |   2 | 2019-10-29 16:35:24   |  13.66
      3 |   2 | 2019-10-29 16:35:25.5 |  13.76
      3 |   2 | 2019-10-29 16:35:27   |  13.86
      3 |   2 | 2019-10-29 16:35:28.5 |  13.97
    

    So we want the begin row of each group of rows that are not further apart in time than 2 seconds, for the same user. Vertica can identify such groups. The process is generally called "sessionization". We have the Vertica OLAP function CONDITIONAL_TRUE_EVENT() that does that for us: it starts with 0 for each PARTITION and increments by 1 every time the Boolean expression in the parentheses is true.

    SELECT                                                                      
      CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
        PARTITION BY usr ORDER BY ts
      ) AS session_id
    , *
    FROM input
    

    gives us:

     session_id | id | usr |          ts           | amount 
    ------------+----+-----+-----------------------+--------
              0 |  1 |   1 | 2019-10-29 16:35:00   |  10.00
              0 |  1 |   1 | 2019-10-29 16:35:01.5 |  10.26
              0 |  1 |   1 | 2019-10-29 16:35:03   |  10.52
              0 |  1 |   1 | 2019-10-29 16:35:04.5 |  10.78
              0 |  1 |   1 | 2019-10-29 16:35:06   |  11.03
              0 |  1 |   1 | 2019-10-29 16:35:07.5 |  11.29
              0 |  1 |   1 | 2019-10-29 16:35:09   |  11.55
              0 |  1 |   1 | 2019-10-29 16:35:10.5 |  11.81
              1 |  1 |   1 | 2019-10-29 16:35:18   |  13.10
              1 |  1 |   1 | 2019-10-29 16:35:19.5 |  13.36
              1 |  1 |   1 | 2019-10-29 16:35:21   |  13.62
              2 |  1 |   1 | 2019-10-29 16:35:27   |  14.66
              2 |  1 |   1 | 2019-10-29 16:35:28.5 |  14.91
              0 |  3 |   2 | 2019-10-29 16:35:00   |  12.00
              0 |  3 |   2 | 2019-10-29 16:35:01.5 |  12.10
              0 |  3 |   2 | 2019-10-29 16:35:03   |  12.21
              0 |  3 |   2 | 2019-10-29 16:35:04.5 |  12.31
              1 |  3 |   2 | 2019-10-29 16:35:12   |  12.83
              1 |  3 |   2 | 2019-10-29 16:35:13.5 |  12.93
              1 |  3 |   2 | 2019-10-29 16:35:15   |  13.03
              1 |  3 |   2 | 2019-10-29 16:35:16.5 |  13.14
              2 |  3 |   2 | 2019-10-29 16:35:24   |  13.66
              2 |  3 |   2 | 2019-10-29 16:35:25.5 |  13.76
              2 |  3 |   2 | 2019-10-29 16:35:27   |  13.86
              2 |  3 |   2 | 2019-10-29 16:35:28.5 |  13.97                     
    

    And, to get the first row of each group, we use the Vertica specific analytic LIMIT clause :

    WITH
    with_sess_id AS (
      SELECT
        CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
          PARTITION BY usr ORDER BY ts
        ) AS session_id
      , *
      FROM input
    )
    SELECT 
      id
    , usr
    , ts
    , amount
    FROM with_sess_id
    LIMIT 1 OVER(PARTITION BY usr,session_id ORDER BY ts);
    

    You get:

     id | usr |         ts          | amount 
    ----+-----+---------------------+--------
      1 |   1 | 2019-10-29 16:35:00 |  10.00
      1 |   1 | 2019-10-29 16:35:18 |  13.10
      1 |   1 | 2019-10-29 16:35:27 |  14.66
      3 |   2 | 2019-10-29 16:35:00 |  12.00
      3 |   2 | 2019-10-29 16:35:12 |  12.83
      3 |   2 | 2019-10-29 16:35:24 |  13.66
    

    If I understood well your last question, you would like to have the average number of rows per session as we have determined above, and the average of the average amount per session as we have determined above. That would then be, if I understood your question right:

    WITH
    with_sess_id AS (
      SELECT
        CONDITIONAL_TRUE_EVENT(ts - LAG(ts) > INTERVAL '2000 msec') OVER(
          PARTITION BY usr ORDER BY ts
        ) AS session_id
      , *
      FROM input
    )
    ,
    session_summary AS (
      SELECT
        usr
      , session_id
      , COUNT(*) AS rows_per_session
      , AVG(amount) AS avg_amt_per_session
      FROM with_sess_id
      GROUP BY 1,2
      -- this returns:
      --  usr | session_id | rows_per_session | avg_amt_per_session 
      -- -----+------------+------------------+---------------------
      --    1 |          0 |                8 |              10.905
      --    1 |          1 |                3 |               13.36
      --    1 |          2 |                2 |              14.785
      --    2 |          0 |                4 |              12.155
      --    2 |          1 |                4 |             12.9825
      --    2 |          2 |                4 |             13.8125
    )
    SELECT
      AVG(rows_per_session) AS avg_rows_per_session
    , AVG(avg_amt_per_session) AS avg_avg_amount_per_session
    FROM session_summary;
    avg_rows_per_session | avg_avg_amount_per_session
     ----------------------+----------------------------
          4.16666666666667 |                         13