Search code examples
sqlpostgresqlaggregate-functionswindow-functionspostgresql-performance

Counting preceding occurences of an event within a given interval for each event row with a window function


I have table storing events occurring to users as shown in http://sqlfiddle.com/#!15/2b559/2/0

event_id(integer)
user_id(integer)
event_type(integer)
timestamp(timestamp)

A sample of the data looks as follows:

+-----------+----------+-------------+----------------------------+
| event_id  | user_id  | event_type  |         timestamp          |
+-----------+----------+-------------+----------------------------+
|        1  |       1  |          1  | January, 01 2015 00:00:00  |
|        2  |       1  |          1  | January, 10 2015 00:00:00  |
|        3  |       1  |          1  | January, 20 2015 00:00:00  |
|        4  |       1  |          1  | January, 30 2015 00:00:00  |
|        5  |       1  |          1  | February, 10 2015 00:00:00 |
|        6  |       1  |          1  | February, 21 2015 00:00:00 |
|        7  |       1  |          1  | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+

I would like to get, for each event, the number of events of the same user and the same event_type that occurred within 30 days before the event.

It should look like the following:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     3 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     3 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     4 |
+-----------+----------+-------------+-----------------------------+-------+

The table contains millions of rows so I cannot go with a correlated subquery as suggested by @jpw in the answers below.

So far I managed to get the total number of events that occurred before with the same user_id and same event_id by using the following query:

SELECT event_id, user_id,event_type,"timestamp",
COUNT(event_type) OVER w
FROM events
WINDOW w AS (PARTITION BY user_id,event_type ORDER BY timestamp
ROWS UNBOUNDED PRECEDING);

With the following result:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     5 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     6 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     7 |
+-----------+----------+-------------+-----------------------------+-------+

Do you know if there a way to change the window frame specification or the COUNT function so only the number of events which occurred within x days is returned?

In a second time, I would like to exclude duplicate events, i.e. same event_type and same timestamp.


Solution

  • I provided a more detailed answer plus fiddle under the duplicate question on dba.SE.

    Basically:

    CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
    

    And either:

    SELECT *
    FROM   events e
        ,  LATERAL (
       SELECT count(*) AS ct
       FROM   events 
       WHERE  user_id    = e.user_id 
       AND    event_type = e.event_type
       AND    ts >= e.ts - interval '30 days'
       AND    ts <= e.ts
       ) ct
    ORDER  BY event_id;
    

    Or:

    SELECT e.*, count(*) AS ct
    FROM   events e
    JOIN   events x USING (user_id, event_type)
    WHERE  x.ts >= e.ts - interval '30 days'
    AND    x.ts <= e.ts
    GROUP  BY e.event_id
    ORDER  BY e.event_id;