sql postgresql aggregate-functions window-functions postgresql-performance

Counting preceding occurences of an event within a given interval for each event row with a window function

I have table storing events occurring to users as shown in http://sqlfiddle.com/#!15/2b559/2/0

event_id(integer)
user_id(integer)
event_type(integer)
timestamp(timestamp)

A sample of the data looks as follows:

+-----------+----------+-------------+----------------------------+
| event_id  | user_id  | event_type  |         timestamp          |
+-----------+----------+-------------+----------------------------+
|        1  |       1  |          1  | January, 01 2015 00:00:00  |
|        2  |       1  |          1  | January, 10 2015 00:00:00  |
|        3  |       1  |          1  | January, 20 2015 00:00:00  |
|        4  |       1  |          1  | January, 30 2015 00:00:00  |
|        5  |       1  |          1  | February, 10 2015 00:00:00 |
|        6  |       1  |          1  | February, 21 2015 00:00:00 |
|        7  |       1  |          1  | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+

I would like to get, for each event, the number of events of the same user and the same event_type that occurred within 30 days before the event.

It should look like the following:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     3 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     3 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     4 |
+-----------+----------+-------------+-----------------------------+-------+

The table contains millions of rows so I cannot go with a correlated subquery as suggested by @jpw in the answers below.

So far I managed to get the total number of events that occurred before with the same user_id and same event_id by using the following query:

SELECT event_id, user_id,event_type,"timestamp",
COUNT(event_type) OVER w
FROM events
WINDOW w AS (PARTITION BY user_id,event_type ORDER BY timestamp
ROWS UNBOUNDED PRECEDING);

With the following result:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     5 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     6 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     7 |
+-----------+----------+-------------+-----------------------------+-------+

Do you know if there a way to change the window frame specification or the COUNT function so only the number of events which occurred within x days is returned?

In a second time, I would like to exclude duplicate events, i.e. same event_type and same timestamp.

Solution

I provided a more detailed answer plus fiddle under the duplicate question on dba.SE.

Basically:

CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);

And either:

SELECT *
FROM   events e
    ,  LATERAL (
   SELECT count(*) AS ct
   FROM   events 
   WHERE  user_id    = e.user_id 
   AND    event_type = e.event_type
   AND    ts >= e.ts - interval '30 days'
   AND    ts <= e.ts
   ) ct
ORDER  BY event_id;

Or:

SELECT e.*, count(*) AS ct
FROM   events e
JOIN   events x USING (user_id, event_type)
WHERE  x.ts >= e.ts - interval '30 days'
AND    x.ts <= e.ts
GROUP  BY e.event_id
ORDER  BY e.event_id;