Search code examples
sqlpostgresqlindexinggreatest-n-per-grouppostgresql-performance

Optimize GROUP BY query to retrieve latest row per user


I have the following log table for user messages (simplified form) in Postgres 9.2:

CREATE TABLE log (
    log_date DATE,
    user_id  INTEGER,
    payload  INTEGER
);

It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. payload is ever increasing for each user (if that matters).

I want to efficiently retrieve the latest record for each user before a specific date. My query is:

SELECT user_id, max(log_date), max(payload) 
FROM log 
WHERE log_date <= :mydate 
GROUP BY user_id

which is extremely slow. I have also tried:

SELECT DISTINCT ON(user_id), log_date, payload
FROM log
WHERE log_date <= :mydate
ORDER BY user_id, log_date DESC;

which has the same plan and is equally slow.

So far I have a single index on log(log_date), but doesn't help much.

And I have a users table with all users included. I also want to retrieve the result for some some users (those with payload > :value).

Is there any other index I should use to speed this up, or any other way to achieve what I want?


Solution

  • For best read performance you need a multicolumn index:

    CREATE INDEX log_combo_idx
    ON log (user_id, log_date DESC NULLS LAST);
    

    To make index only scans possible, add the otherwise not needed column payload in a covering index with the INCLUDE clause (Postgres 11 or later):

    CREATE INDEX log_combo_covering_idx
    ON log (user_id, log_date DESC NULLS LAST) INCLUDE (payload);
    

    See:

    Fallback for older versions:

    CREATE INDEX log_combo_covering_idx
    ON log (user_id, log_date DESC NULLS LAST, payload);
    

    Why DESC NULLS LAST?

    For few rows per user_id or small tables DISTINCT ON is typically fastest and simplest:

    For many rows per user_id an index skip scan (or loose index scan) is (much) more efficient. That's not implemented up to Postgres 15 (work is ongoing). But there are ways to emulate it efficiently.

    Common Table Expressions require Postgres 8.4+.
    LATERAL requires Postgres 9.3+.
    The following solutions go beyond what's covered in the Postgres Wiki.

    1. No separate table with unique users

    With a separate users table, solutions in 2. below are typically simpler and faster. Skip ahead.

    1a. Recursive CTE with LATERAL join

    WITH RECURSIVE cte AS (
       (                                -- parentheses required
       SELECT user_id, log_date, payload
       FROM   log
       WHERE  log_date <= :mydate
       ORDER  BY user_id, log_date DESC NULLS LAST
       LIMIT  1
       )
       UNION ALL
       SELECT l.*
       FROM   cte c
       CROSS  JOIN LATERAL (
          SELECT l.user_id, l.log_date, l.payload
          FROM   log l
          WHERE  l.user_id > c.user_id  -- lateral reference
          AND    log_date <= :mydate    -- repeat condition
          ORDER  BY l.user_id, l.log_date DESC NULLS LAST
          LIMIT  1
          ) l
       )
    TABLE  cte
    ORDER  BY user_id;
    

    This is simple to retrieve arbitrary columns and probably best in current Postgres. More explanation in chapter 2a. below.

    1b. Recursive CTE with correlated subquery

    WITH RECURSIVE cte AS (
       (                                           -- parentheses required
       SELECT l AS my_row                          -- whole row
       FROM   log l
       WHERE  log_date <= :mydate
       ORDER  BY user_id, log_date DESC NULLS LAST
       LIMIT  1
       )
       UNION ALL
       SELECT (SELECT l                            -- whole row
               FROM   log l
               WHERE  l.user_id > (c.my_row).user_id
               AND    l.log_date <= :mydate        -- repeat condition
               ORDER  BY l.user_id, l.log_date DESC NULLS LAST
               LIMIT  1)
       FROM   cte c
       WHERE  (c.my_row).user_id IS NOT NULL       -- note parentheses
       )
    SELECT (my_row).*                              -- decompose row
    FROM   cte
    WHERE  (my_row).user_id IS NOT NULL
    ORDER  BY (my_row).user_id;
    

    Convenient to retrieve a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.

    To assert a row was found in the previous iteration, test a single NOT NULL column (like the primary key).

    More explanation for this query in chapter 2b. below.

    Related:

    2. With separate users table

    Table layout hardly matters as long as exactly one row per relevant user_id is guaranteed. Example:

    CREATE TABLE users (
       user_id  serial PRIMARY KEY
     , username text NOT NULL
    );
    

    Ideally, the table is physically sorted in sync with the log table. See:

    Or it's small enough (low cardinality) that it hardly matters. Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition. If the physical sort order of the users table happens to match the index on log, this may be irrelevant.

    2a. LATERAL join

    SELECT u.user_id, l.log_date, l.payload
    FROM   users u
    CROSS  JOIN LATERAL (
       SELECT l.log_date, l.payload
       FROM   log l
       WHERE  l.user_id = u.user_id         -- lateral reference
       AND    l.log_date <= :mydate
       ORDER  BY l.log_date DESC NULLS LAST
       LIMIT  1
       ) l;
    

    JOIN LATERAL allows to reference preceding FROM items on the same query level. See:

    Results in one index (-only) look-up per user.

    Returns no row for users missing in the users table. Typically, a foreign key constraint enforcing referential integrity would rule that out.

    Also, no row for users without matching entry in log - conforming to the original question. To keep those users in the result use LEFT JOIN LATERAL ... ON true instead of CROSS JOIN LATERAL:

    Use LIMIT n instead of LIMIT 1 to retrieve more than one rows (but not all) per user.

    Effectively, all of these do the same:

    JOIN LATERAL ... ON true
    CROSS JOIN LATERAL ...
    , LATERAL ...
    

    The last one has lower priority, though. Explicit JOIN binds before comma. That subtle difference can matters with more join tables. See:

    2b. Correlated subquery

    Good choice to retrieve a single column from a single row. Code example:

    The same is possible for multiple columns, but you need more smarts:

    CREATE TEMP TABLE combo (log_date date, payload int);
    
    SELECT user_id, (combo1).*              -- note parentheses
    FROM (
       SELECT u.user_id
            , (SELECT (l.log_date, l.payload)::combo
               FROM   log l
               WHERE  l.user_id = u.user_id
               AND    l.log_date <= :mydate
               ORDER  BY l.log_date DESC NULLS LAST
               LIMIT  1) AS combo1
       FROM   users u
       ) sub;
    

    Like LEFT JOIN LATERAL above, this variant includes all users, even without entries in log. You get NULL for combo1, which you can easily filter with a WHERE clause in the outer query if need be.
    Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all column values happen to be NULL - same result. You need a NOT NULL column in the subquery to avoid this ambiguity.

    A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
    Use a registered type like the row type of an existing table. Or register a composite type explicitly (and permanently) with CREATE TYPE. Or create a temporary table (dropped automatically at end of session) to register its row type temporarily. Cast syntax: (log_date, payload)::combo

    Finally, we do not want to decompose combo1 on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (still true in Postgres 12). Instead, make it a subquery and decompose in the outer query.

    Related:

    Demonstrating all 4 queries with 100k log entries and 1k users:
    db<>fiddle here - pg 11
    Old sqlfiddle