Given a database table events
with columns: event_id
, correlation_id
, username
, create_timestamp
. It contains more than 1M records.
There is a problem I am trying to solve: for each event of a particular user display its latest sibling event. Sibling events are the events which has same correlation_id
. The query I use for that is the following:
SELECT
"events"."event_id" AS "event_id",
"latest"."event_id" AS "latest_event_id"
FROM
events "events"
JOIN (
SELECT
"latest"."correlation_id" AS "correlation_id",
"latest"."event_id" AS "event_id",
ROW_NUMBER () OVER (
PARTITION BY "latest"."correlation_id"
ORDER BY
"latest"."create_timestamp" ASC
) AS "rn"
FROM
events "latest"
) "latest" ON (
"latest"."correlation_id" = "events"."correlation_id"
AND "latest"."rn" = 1
)
WHERE
"events"."username" = 'user1'
It gets correct list of results but causes performance problems which must be fixed. Here is an execution plan of the query:
Hash Right Join (cost=13538.03..15522.72 rows=1612 width=64)
Hash Cond: (("latest".correlation_id)::text = ("events".correlation_id)::text)
-> Subquery Scan on "latest" (cost=12031.35..13981.87 rows=300 width=70)
Filter: ("latest".rn = 1)
-> WindowAgg (cost=12031.35..13231.67 rows=60016 width=86)
-> Sort (cost=12031.35..12181.39 rows=60016 width=78)
Sort Key: "latest_1".correlation_id, "latest_1".create_timestamp
-> Seq Scan on events "latest_1" (cost=0.00..7268.16 rows=60016 width=78)
-> Hash (cost=1486.53..1486.53 rows=1612 width=70)
-> Index Scan using events_username on events "events" (cost=0.41..1486.53 rows=1612 width=70)
Index Cond: ((username)::text = 'user1'::text)
From the plan, I can conclude that the performance problem mainly caused by calculation of latest events for ALL events in the table which takes ~80% of cost. Also it performs the calculations even if there are no events for a user at all. Ideally, I would like the query to do these steps which seem more efficient for me:
To simplify the discussion, let's consider all required indexes as already created for needed columns. It doesn't seem to me that the problem can be solved purely by index creation.
Any ideas what can be done to improve the performance? Probably, there are options to rewrite the query or adjust a configuration of the table.
Note that this question is significantly obfuscated in terms of business meaning to clearly demonstrate the technical problem I face.
The window function has to scan the whole table. It has no idea that really you are only interested in the first value. A lateral join could perform better and is more readable anyway:
SELECT
e.event_id,
latest.latest_event_id
FROM
events AS e
CROSS JOIN LATERAL
(SELECT
l.event_id AS latest_event_id
FROM
events AS l
WHERE
l.correlation_id = e.correlation_id
ORDER BY l.create_timestamp
FETCH FIRST 1 ROWS ONLY
) AS latest
WHERE e.username = 'user1';
The perfect index to support that would be
CREATE INDEX ON event (correlation_id, create_timestamp);