Search code examples
arrayspostgresqlanypostgres-fdw

PostgreSQL fdw table performance drops exponentially after exactly 5 identical queries


I have a postgres 13.3 database running on a CentOS machine. Nothing else is running on this machine as these tests are being performed, and nothing else accesses the database while tests are performed.

The table jammerdal contains about 500.000 rows. I have repeated this experiment with other tables containing around 50.000 rows. The result is the same, but the slowdown seems to correlate with the number of rows in the fdw table used - and the number of fake id's generated.

Running this:

CREATE OR REPLACE FUNCTION jegfatterintet() RETURNS TABLE (c BIGINT)
AS $$
DECLARE
    jammerdal_ids VARCHAR[];
    area_row RECORD;
    start TIMESTAMP;
BEGIN
    SELECT INTO jammerdal_ids ARRAY_AGG('id-'||x::VARCHAR) FROM generate_series(0,25) x;

    FOR area_row IN SELECT * FROM generate_series(1,10)
        LOOP
        SELECT INTO start clock_timestamp();
        --RAISE NOTICE '%: Start %.', clock_timestamp(), area_row;
        RETURN QUERY
              SELECT COUNT(*) FROM jammerdal WHERE id = ANY(jammerdal_ids);
        --RAISE NOTICE '%: End %.', clock_timestamp(), area_row;
        RAISE NOTICE '%: Duration is %.', area_row, clock_timestamp()-start;
    END LOOP;
    RAISE NOTICE '%: All done.', clock_timestamp();
END
$$ LANGUAGE plpgsql;    

SELECT * FROM jegfatterintet();

Yields this output:

CREATE FUNCTION
NOTICE:  (1): Duration is 00:00:00.019555.
NOTICE:  (2): Duration is 00:00:00.001271.
NOTICE:  (3): Duration is 00:00:00.001089.
NOTICE:  (4): Duration is 00:00:00.00118.
NOTICE:  (5): Duration is 00:00:00.001035.
NOTICE:  (6): Duration is 00:00:02.954527.
NOTICE:  (7): Duration is 00:00:02.871185.
NOTICE:  (8): Duration is 00:00:02.812426.
NOTICE:  (9): Duration is 00:00:02.777037.
NOTICE:  (10): Duration is 00:00:02.90708.
NOTICE:  2021-09-07 11:21:53.577115+00: All done.
 c 
---
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
(10 rows)

Notice how the duration suddenly climbs from below 0.01s to almost 3s from step 6 forward.

This only happens when jammerdal is a foreign (fdw) table, not when it's local. And it only happens when using the array of ids.

If I change the function to:

CREATE OR REPLACE FUNCTION jegfatterintet() RETURNS TABLE (c BIGINT)
AS $$
DECLARE
    jammerdal_ids VARCHAR[];
    area_row RECORD;
    start TIMESTAMP;
BEGIN
    SELECT INTO jammerdal_ids ARRAY_AGG('id-'||x::VARCHAR) FROM generate_series(0,25) x;

    FOR area_row IN SELECT * FROM generate_series(1,10)
        LOOP
        SELECT INTO start clock_timestamp();
        --RAISE NOTICE '%: Start %.', clock_timestamp(), area_row;
        RETURN QUERY
              SELECT COUNT(*) FROM jammerdal WHERE id IN ('id-0', 'id-1', 'id-2', 'id-3', 'id-4', 'id-5', 'id-6', 'id-7', 'id-8', 'id-9', 'id-10', 'id-11', 'id-12', 'id-13', 'id-14', 'id-15', 'id-16', 'id-17', 'id-18', 'id-19', 'id-20', 'id-21', 'id-22', 'id-23', 'id-24', 'id-25'); --id = ANY(jammerdal_ids);
        --RAISE NOTICE '%: End %.', clock_timestamp(), area_row;
        RAISE NOTICE '%: Duration is %.', area_row, clock_timestamp()-start;
    END LOOP;
    RAISE NOTICE '%: All done.', clock_timestamp();
END
$$ LANGUAGE plpgsql;

SELECT * FROM jegfatterintet();

output becomes:

CREATE FUNCTION
NOTICE:  (1): Duration is 00:00:00.028254.
NOTICE:  (2): Duration is 00:00:00.001768.
NOTICE:  (3): Duration is 00:00:00.001512.
NOTICE:  (4): Duration is 00:00:00.001426.
NOTICE:  (5): Duration is 00:00:00.001523.
NOTICE:  (6): Duration is 00:00:00.001389.
NOTICE:  (7): Duration is 00:00:00.001363.
NOTICE:  (8): Duration is 00:00:00.001364.
NOTICE:  (9): Duration is 00:00:00.001466.
NOTICE:  (10): Duration is 00:00:00.001454.
NOTICE:  2021-09-07 11:25:46.635762+00: All done.
 c 
---
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
(10 rows)

Can anyone please explain this to me?

EDIT:

explain (ANALYZE, BUFFERS) select id from jammerdal where id = any('{id-0,id-1,id-2,id-3,id-4,id-5,id-6,id-7,id-8,id-9,id-10,id-11,id-12,id-13,id-14,id-15,id-16,id-17,id-18,id-19,id-20,id-21,id-22,id-23,id-24,id-25}');
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Foreign Scan on jammerdal  (cost=100.00..62578.95 rows=26 width=37) (actual time=0.718..0.719 rows=0 loops=1)
 Planning Time: 0.153 ms
 Execution Time: 1.101 ms
(3 rows)

Also an EXPLAIN run in the "foreign" database:

explain (ANALYZE, BUFFERS) select id from jammerdal where id = any('{id-0,id-1,id-2,id-3,id-4,id-5,id-6,id-7,id-8,id-9,id-10,id-11,id-12,id-13,id-14,id-15,id-16,id-17,id-18,id-19,id-20,id-21,id-22,id-23,id-24,id-25}');
                                                                                        QUERY PLAN                                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using jammerdal_pkey on jammerdal  (cost=0.42..219.41 rows=26 width=37) (actual time=0.286..0.290 rows=0 loops=1)
   Index Cond: (id = ANY ('{id-0,id-1,id-2,id-3,id-4,id-5,id-6,id-7,id-8,id-9,id-10,id-11,id-12,id-13,id-14,id-15,id-16,id-17,id-18,id-19,id-20,id-21,id-22,id-23,id-24,id-25}'::text[]))
   Heap Fetches: 0
   Buffers: shared hit=81
 Planning:
   Buffers: shared hit=205
 Planning Time: 2.111 ms
 Execution Time: 0.423 ms
(8 rows)

Btw

ANALYZE jammerdal;

has no effect.

EDIT 2: Problem clearly is that the fdw table does not use the index on id...

ALTER SERVER testdb  OPTIONS (ADD use_remote_estimate 'true');

did the trick!


Solution

  • Since the query is in a function, PostgreSQL caches execution plans. That works according to a special heuristics:

    • for the first five executions, PostgreSQL generates a “custom plan” that uses the actual parameter values (jammerdal_ids)

    • at the sixth execution, PostgreSQL checks if a “generic plan” (that ignores the parameter values) is estimated to perform as well

    • if yes, then the generic plan is used from the sixth execution on to save planning time

    In your case that generic plan is obviously bad.

    Since you don't show EXPLAIN (ANALYZE, BUFFERS) output, we can only guess why. But a good guess is that you forgot to ANALYZE the foreign table and have bad statistics. So do that with

    ANALYZE jammerdal;
    

    and you should notice an improvement.

    (Note that PostgreSQL does not automatically collect statistics for foreign tables.)