Search code examples
sqlpostgresqlgreatest-n-per-grouppostgresql-performance

How to LIMIT the number of parent values, but return all children for each?


There are transactions table and logs table. logs are linked to transactions by transaction_id. I need to query logs by address, join it with transactions, aggregate logs to array, LIMIT transactions (example is LIMIT 2) and FETCH ALL LOGS that were in that transaction (but query only by one address field).

create table transactions
(id int,
 hash varchar);
 
create table logs
(transaction_id int,
 address varchar,
 value varchar
);

create index on logs(address);

insert into transactions values 
(1, 'h1'),
(2, 'h2'),
(3, 'h3'),
(4, 'h4'),
(5, 'h5')
;
 
insert into logs values 
(1, 'a1', 'h1.a1.1'),
(1, 'a1', 'h1.a1.2'),
(1, 'a3', 'h1.a3.1'),
(2, 'a1', 'h2.a1.1'),
(2, 'a2', 'h2.a2.1'),
(2, 'a2', 'h2.a2.2'),
(2, 'a3', 'h2.a3.1'),
(3, 'a2', 'h3.a2.1'),
(4, 'a1', 'h4.a1.1'),
(5, 'a2', 'h5.a2.1'),
(5, 'a3', 'h5.a3.1')
;

Result must be with query WHERE log.address='a2' LIMIT 2:

id  logs_array
2   [{"address":"a1","value":"h2.a1.1"},{"address":"a2","value":"h2.a2.1"},{"address":"a2","value":"h2.a2.2"},{"address":"a3","value":"h2.a3.1"}]
3   [{"address":"a2","value":"h3.a2.1"}]

Problem: sql query below works correct, but on very high amount of logs (100k+ logs for 1 address) it can take many minutes for search. The solution would be set LIMIT in MATERIALIZED, but in that case I can get transactions with not fully correct list of logs. How to fix? Either rewrite query without MATERIALIZED and use multiple SELECT inside each other, but I don't know how, or fix with MATERIALIZED.

So problem is that Postgres does not understand correct in MATERIALIZED that I need limited number of transactions, it first searches all logs, only then append them to transactions with limit (as I guess). Index on logs(address) is set.

WITH b AS MATERIALIZED (
        SELECT lg.transaction_id
        FROM logs lg
        WHERE lg.address='a2'
      
        -- this must be commented, otherwise not correct results, although fast execution
        -- LIMIT 2
    )
SELECT 
    id,
    (SELECT array_agg(JSON_BUILD_OBJECT('address',address,'value',value)) FROM logs WHERE transaction_id = t.id) logs_array
FROM transactions t 
WHERE t.id IN 
    (SELECT transaction_id FROM b)
LIMIT 2

Real-world example, query was executing ~30 seconds:

EXPLAIN WITH 
    b AS MATERIALIZED (
        SELECT lg.transaction_id
        FROM _logs lg
        WHERE lg.address in ('0xca530408c3e552b020a2300debc7bd18820fb42f', '0x68e78497a7b0db7718ccc833c164a18d8e626816')
    )
SELECT 
    (SELECT array_agg(JSON_BUILD_OBJECT('address',address)) FROM _logs WHERE transaction_id = t.id) logs_array
FROM _transactions t 
WHERE t.id IN 
    (SELECT transaction_id FROM b)
LIMIT 5000;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=87540.62..3180266.26 rows=5000 width=32)
   CTE b
     ->  Index Scan using _logs_address_idx on _logs lg  (cost=0.70..85820.98 rows=76403 width=8)
           Index Cond: ((address)::text = ANY ('{0xca530408c3e552b020a2300debc7bd18820fb42f,0x68e78497a7b0db7718ccc833c164a18d8e626816}'::text[]))
   ->  Nested Loop  (cost=1719.64..47260423.09 rows=76403 width=32)
         ->  HashAggregate  (cost=1719.07..1721.07 rows=200 width=8)
               Group Key: b.transaction_id
               ->  CTE Scan on b  (cost=0.00..1528.06 rows=76403 width=8)
         ->  Index Only Scan using _transactions_pkey on _transactions t  (cost=0.57..2.79 rows=1 width=8)
               Index Cond: (id = b.transaction_id)
         SubPlan 2
           ->  Aggregate  (cost=618.53..618.54 rows=1 width=32)
                 ->  Index Scan using _logs_transaction_id_idx on _logs  (cost=0.57..584.99 rows=6707 width=43)
                       Index Cond: (transaction_id = t.id)
 JIT:
   Functions: 17
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(17 rows)

upd: I did not point correct, I need also get results from transactions table (hash field). So result will be:

id   hash  logs_array
2    h2    [{"address":"a1","value":"h2.a1.1"},{"address":"a2","value":"h2.a2.1"},{"address":"a2","value":"h2.a2.2"},{"address":"a3","value":"h2.a3.1"}]
3    h3    [{"address":"a2","value":"h3.a2.1"}]

Solution

  • Simple solution

    SELECT a.transaction_id AS id
         , ARRAY(SELECT json_build_object('address',address,'value',value)
                 FROM logs l WHERE l.transaction_id = a.transaction_id) AS logs_array
    FROM  (
       SELECT DISTINCT l.transaction_id
       FROM   logs l
       WHERE  l.address = 'a2'  -- your address here
    -- ORDER  BY l.transaction_id  -- choose which?
       LIMIT  2  -- your LIMIT here
       ) a
    

    No need for a MATERIALIZED CTE.
    Using a cheaper ARRAY constructor for the simple task. See:

    This returns a maximum of LIMIT 2 transactions with all logs, where the given address shows up at least once. Add ORDER BY before LIMIT to get a deterministic selection.

    But that's only fast for few rows per address. Here comes your performance problem:

    but on very high amount of logs (100k+ logs for 1 address) it can take many minutes for search

    FAST solution for many rows per address

    Be sure to have an index on logs (address, transaction_id)!
    If your example is real, and there is only one small additional column to satisfy your query, add that as INCLUDE column to the index to get all index-only scans - if your table is vacuum'ed enough.

    CREATE INDEX logs_address_transaction_id ON logs (address, transaction_id) INCLUDE (value);
    

    (The rCTE will use index-only scans anyway, so the INCLUDE clause is only a minor improvement for the outer SELECT, after most of the hard work has already been done.)

    Then emulate an index-skip scan with a recursive CTE (rCTE). See:

    WITH RECURSIVE adr_trans AS (
       (
       SELECT l.transaction_id
       FROM   logs l
       WHERE  l.address = 'a2'  -- your address here
       ORDER  BY l.transaction_id
       LIMIT  1
       )
       UNION ALL
       SELECT (SELECT l.transaction_id
               FROM   logs l
               WHERE  l.address = 'a2'  -- your address here
               AND    l.transaction_id > a.transaction_id
               ORDER  BY l.transaction_id
               LIMIT  1
               )
       FROM   adr_trans a
       WHERE  a.transaction_id IS NOT NULL
       )
    SELECT a.transaction_id AS id
         , (SELECT json_agg(l.*) FROM (
               SELECT l.address, l.value
               FROM   logs l
               WHERE  l.transaction_id = a.transaction_id
           --  ORDER  BY l.address  -- sort?
               ) l) AS logs
    FROM   adr_trans a
    WHERE  a.transaction_id IS NOT NULL  -- eliminate possible dangling null row
    LIMIT  2;  -- your LIMIT here
    

    fiddle - also showing prepared query with params & query plan
    Works in Postgres 12, too.

    Now we are talking milliseconds instead of seconds or minutes.

    Plus, the rCTE stops executing as soon as the outer LIMIT is satisfied. Won't get faster than this.

    I would build a JSON array of objects (type json) instead of an array of JSON values (type json[]). Just a suggestion.

    And I would add ORDER BY to the final subquery on logs to get sorted array elements. (With index-only scans you'll typically get logs sorted by the index sort order, but that's not guaranteed.)