Search code examples
postgresqlwindow-functions

How to pass through window function barier for PostgreSQL database?


I use PostgreSQL v15.4. I have the next query:

explain (analyze, costs, verbose, buffers) 
with vimages as (
    SELECT
        vim.view_id,
        array_to_json(array_agg(row_to_json( row(            
            vim.created_at,
            igim.id, 
            igim.name, 
            vim.sequence_no
        )) order by vim.sequence_no desc)) as json
    FROM view_images vim
    JOIN xx_images igim ON vim.xx_image_id = igim.id
    GROUP BY vim.view_id
)

select vimages.*
from xxs ig
left join vimages on vimages.view_id = ig.id
WHERE ig.alias = '257_belmont_cir_brunswick_ga'

This is the plan for this query. The query takes 27sec to execute. As result there is only one row.

If I hardcode the id for the result row, eg. and ig.id = 682430783638437250 then query takes only 6ms.

Could it be possible to reformat my query in a such way, so that them vim table is immediately filtered and then aggregated, when I query ig table by alias column? Just like this happens when I hardcode the value for ID.


Solution

  • Example dataset (10 times smaller):

    CREATE UNLOGGED TABLE views( view_id INTEGER NOT NULL, alias INTEGER NOT NULL );
    INSERT INTO views SELECT n,n/3 FROM generate_series(1,10000) n;
    ALTER TABLE views ADD PRIMARY KEY (view_id);
    CREATE INDEX ON views(alias);
    
    CREATE UNLOGGED TABLE images( image_id INTEGER NOT NULL, image_foo INT NOT NULL );
    INSERT INTO images SELECT n,n FROM generate_series(1,400000) n;
    ALTER TABLE images ADD PRIMARY KEY (image_id);
    
    CREATE UNLOGGED TABLE view_images( view_id INTEGER NOT NULL, image_id INTEGER NOT NULL, vim_foo INT NOT NULL );
    INSERT INTO view_images SELECT (1+random()*10000)::INTEGER view_id, (1+random()*400000)::INTEGER image_id, n FROM generate_series(1,1500000) n;
    CREATE INDEX ON view_images( view_id );
    CREATE INDEX ON view_images( image_id );
    
    VACUUM ANALYZE;
    
    EXPLAIN ANALYZE 
    with vimages as (
        SELECT
            vim.view_id,
            array_to_json(array_agg(row_to_json( row(            
                vim.vim_foo,
                im.image_id, 
                im.image_foo
            )) order by vim.vim_foo desc)) as json
        FROM view_images vim
        JOIN images im USING (image_id)
        GROUP BY vim.view_id
    )
    
    select vimages.*
    from views 
    left join vimages USING (view_id)
    WHERE views.alias = 1234;
    

    It reproduces your slow query plan exactly.

    Moving CTE into query: no change.

    Simplify query (below): the plan no longer involves a sort since I removed ORDER BY in the aggregate. I'm not pasting the plan, the problem still remains, it's still doing seq scan and hash join on the images and images_views:

    EXPLAIN SELECT * FROM views LEFT JOIN (
        SELECT vim.view_id, array_agg( im.image_id ) as agg
        FROM view_images vim
        JOIN images im USING (image_id)
        GROUP BY vim.view_id
    ) vimages USING (view_id)
    WHERE views.alias = 1234;
    

    In my example data, alias=1234 corredponds to WHERE view_id IN (3702,3703,3704). If I put that at the end of the query, no change. If I put it inside the subquery, I get the fast plan.

    The problem thus seems to be that it's not propagating the join condition on view_id inside the subquery.

    Solution #1: move the GROUP BY

    EXPLAIN SELECT views.view_id, array_agg( im.image_id ) as agg
    FROM views 
    LEFT JOIN view_images vim USING (view_id)
    LEFT JOIN images im USING (image_id)
    WHERE views.alias = 1234
    GROUP BY views.view_id;
    

    Solution #2: use LATERAL to be able to explicitly move the problematic join predicate

    EXPLAIN SELECT * FROM views LEFT JOIN LATERAL (
        SELECT vim.view_id, array_agg( im.image_id ) as agg
        FROM view_images vim
        JOIN images im USING (image_id)
        WHERE vim.view_id=views.view_id
        GROUP BY vim.view_id
    ) vimages USING (view_id)
    WHERE views.alias = 1234;
    

    Both get fast plans with index scan on the images and view_images hitting only the rows they should.

    If the condition on views.alias is dropped and the whole table actually needs to be scanned, #1 will revert to seq scans which is better in this case, but #2 will keep running nested loops so it will be slower. So #1 would be my preferred option unless there are some other bits of query you cut off before pasting it in your question and would get in the way of the GROUP BY.

    Solution #3:

    If you want several CTEs like "vimages" but on other tables, then you could first make a materialized CTE with only the view_id's you want. Then manually put it into the problematic CTEs as a join. This gets the fast plan:

    with vids AS MATERIALIZED (SELECT view_id FROM views v WHERE v.alias = 1234),
    vimages as (
        SELECT
            vim.view_id,
            array_to_json(array_agg(row_to_json( row(            
                vim.vim_foo,
                im.image_id, 
                im.image_foo
            )) order by vim.vim_foo desc)) as json
        FROM 
        vids JOIN view_images vim USING (view_id)
        JOIN images im USING (image_id)
        GROUP BY vim.view_id
    )
    select vimages.*
    from vids v
    left join vimages USING (view_id);