I have table storing events occurring to users as shown in http://sqlfiddle.com/#!15/2b559/2/0
event_id(integer)
user_id(integer)
event_type(integer)
timestamp(timestamp)
A sample of the data looks as follows:
+-----------+----------+-------------+----------------------------+
| event_id | user_id | event_type | timestamp |
+-----------+----------+-------------+----------------------------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+
I would like to get, for each event, the number of events of the same user and the same event_type that occurred within 30 days before the event.
It should look like the following:
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 | 1 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 | 2 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 | 3 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 | 4 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 | 3 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 | 3 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 | 4 |
+-----------+----------+-------------+-----------------------------+-------+
The table contains millions of rows so I cannot go with a correlated subquery as suggested by @jpw in the answers below.
So far I managed to get the total number of events that occurred before with the same user_id and same event_id by using the following query:
SELECT event_id, user_id,event_type,"timestamp",
COUNT(event_type) OVER w
FROM events
WINDOW w AS (PARTITION BY user_id,event_type ORDER BY timestamp
ROWS UNBOUNDED PRECEDING);
With the following result:
+-----------+----------+-------------+-----------------------------+-------+
| event_id | user_id | event_type | timestamp | count |
+-----------+----------+-------------+-----------------------------+-------+
| 1 | 1 | 1 | January, 01 2015 00:00:00 | 1 |
| 2 | 1 | 1 | January, 10 2015 00:00:00 | 2 |
| 3 | 1 | 1 | January, 20 2015 00:00:00 | 3 |
| 4 | 1 | 1 | January, 30 2015 00:00:00 | 4 |
| 5 | 1 | 1 | February, 10 2015 00:00:00 | 5 |
| 6 | 1 | 1 | February, 21 2015 00:00:00 | 6 |
| 7 | 1 | 1 | February, 22 2015 00:00:00 | 7 |
+-----------+----------+-------------+-----------------------------+-------+
Do you know if there a way to change the window frame specification or the COUNT function so only the number of events which occurred within x days is returned?
In a second time, I would like to exclude duplicate events, i.e. same event_type and same timestamp.
I provided a more detailed answer plus fiddle under the duplicate question on dba.SE.
Basically:
CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
And either:
SELECT *
FROM events e
, LATERAL (
SELECT count(*) AS ct
FROM events
WHERE user_id = e.user_id
AND event_type = e.event_type
AND ts >= e.ts - interval '30 days'
AND ts <= e.ts
) ct
ORDER BY event_id;
Or:
SELECT e.*, count(*) AS ct
FROM events e
JOIN events x USING (user_id, event_type)
WHERE x.ts >= e.ts - interval '30 days'
AND x.ts <= e.ts
GROUP BY e.event_id
ORDER BY e.event_id;