Search code examples
sqlpostgresqlgreatest-n-per-groupaudit

How do I efficiently query versioned rows/entities in PostgreSQL?


Background

I have situation where I store all versions of a given entity in my PostgreSQL database. This is implemented with two tables; one table storing the primary key and immutable properties of the entity and a second table storing the mutable properties of the entity. Both tables are insert-only (enforced by a trigger).

Example

The concept can easily be illustrated with an entity User, stored in the user and user_details tables:

Table user:

id  timestamp
1   2018-04-10T12:00:00
2   2018-04-10T12:00:00

Table user_details:

id  user_id   username  first_name   last_name     timestamp
1   1         bob       Bob          Socks         2018-04-10T12:00:01
2   1         bob       Bobby        Socks         2018-04-10T12:00:02
3   2         alice     Alice        Jones         2018-04-10T12:00:03
4   1         bob       Bobbers      Socks         2018-04-10T12:00:04
5   2         alice     Alicia       Jones         2018-04-10T12:00:05

Both 'id' columns are defined as serial primary keys (strictly incrementing) and I have created an index on user_details (user_id, id DESC).

1 - How do I efficiently query the most recent version of an entity?

Given a user id I need a quick way to fetch the immutable data in user and the most recent entry from user_details. What kind of query would be best suited for this join?

2 - How do I efficiently query versions n and n-1 of an entity?

I am generating audit logs for time intervals by first fetching all rows with timestamp between X and Y and then I fetch the inserted row and its predecessor (same user_id, closest lower id) and produce a diff from these. The number of rows inserted between X and Y is often high, so I need to efficiently fetch the current + previous pairs, i.e. given input user_details(5), I need to select the join of user(2) + user_details(5) and user(2) + user_details(3). What kind of query would be best suited for this join?

Futile attempts

My best results so far has been with these queries:

Query for question 1:

SELECT *
FROM "user" u
JOIN LATERAL (SELECT *
              FROM "user_details" ud
              WHERE u.id = ud.user_id
              ORDER BY id DESC
              LIMIT 1
       ) detail ON TRUE
WHERE u.id IN
      (...);

Query for question 2:

SELECT *
FROM "user" u
JOIN LATERAL (SELECT *
              FROM "user_details" ud
              WHERE u.id = ud.user_id
              AND ud.id IN (...)
              ORDER BY id DESC
              LIMIT 2) ud ON TRUE;

However, both queries end up using nested loops (seen from EXPLAIN ANALYZE) and take a long time to finish when run with a large number of ids (5000+).

Ideas

Can I use the user_details (user_id, id DESC) index in a smart way to first create a CTE of the user_details ids I need and then join user + user_details based on this? Can I create a functional index of some sort? Do I need to maintain a predecessor column in user_details (or another table) do be able to look up relations of this type efficiently?

Thanks!

SQL Fiddle: http://www.sqlfiddle.com/#!17/5f5f0

Solution

Thanks to X and Y for pushing me in the right direction! I ended up using the solution @MichelMilezzi suggested for my first problem and an adaption of the @RadimBača solution for my second problem:

WITH
cte_1 AS (SELECT id, user_id FROM "user_details" WHERE id IN (8999, 9999)),
cte_2 as (SELECT cte_1.id, cte_1.user_id, prev.id AS prev_id, row_number() OVER (PARTITION BY cte_1.id, cte_1.user_id ORDER BY prev.id DESC) AS rownum FROM "user_details" prev, cte_1 WHERE prev.user_id = cte_1.user_id AND prev.id < cte_1.id)
SELECT main.*, detail.*, cte_2.id AS __id, (detail.id <> cte_2.id) AS __is_predecessor FROM "user" main, "user_details" detail, cte_2
WHERE main.id = cte_2.user_id AND cte_2.rownum = 1 AND (detail.id = cte_2.id OR detail.id = cte_2.prev_id);

Solution

  • Consider using window functions

    SELECT *
    FROM "user" u
    JOIN
    (
        SELECT row_number() over(partition by user_id order by id) rn,
               *
        FROM "user_details" ud
    ) t ON t.user_id = u.id
    WHERE t.rn = 1
    

    DEMO

    This solution allows you to query also all N rows per group or N-th row per group.