Postgres query of a partitioned table pointing to foreign data workers many times slower than querying fdw directly

Thanks in advance to anyone who might be able to help with this.

I have a Postgres 15.2 database which uses a partition table to access foreign data worker tables by a Zone id. Each zone further breaks down data by period in each zone database.

The DB I'm developing is approaching 1 billion records but will grow much larger. If I query the fdw table directly from the lead database, it's very fast at around 12ms. If I query the lead databases partition table that points to the same exact location it always takes about a minute. If I specify the zone in the where clause, which is the check constraint for the partitioning, it's still slow.

I've spent a few days here and elsewhere searching for a solution but have been striking out so far. I've also tried simplifying the schema of the tables in question to see if it was related to foreign keys or constraints but that also doesn't seem to be the case.

It seems to happen the same on newly created databases or the long running test system I've been using.

Analyzing the partition table didn't seem to have any effect.

Here's a linux bash script I used for a minimal test setup that reproduces the behavior but at a much smaller scale and on one machine. It creates a leader DB and any number of zone DBs and wires up all the foreign data workers. The lead DB is at 127.0.0.100 and each zone DB is at 127.0.0.100 + zone number. I have to do this from a root level directory that the postgres user has access to. Note: it starts all databases and I wouldn't recommend running this unless in a Virtual Machine with solid state drives and in an empty folder. I show the results below.

cat BUILD_LEAD_AND_ZONES.sh 
#!/usr/bin/env bash
#ARG 1 DBNAME
#ARG 2 a directory that postgres user has permissions to so DBs can be started there
#ARG 3 NUMBER OF ZONES TO CREATE  ALL DBs will be created in the ARG 2 directory with a 127.0.0.X addres where Leader is at 100 and each zone is at 100 + zone number. Can move them somewhere else afterwards

set -e #exit on any non-zero return-code
set -x #echo commands

set -euo pipefail
IFS=$'\n\t'


DBNAME=$1
ROOT_TRGT_DIR=$2
NUMBER_OF_ZONES=$3
BASE_IP4="127.0.0." #with a trailing dot x.x.x.
SCHEMA_NAME="wha"
DBFOLDER="DB_SCHEMA"
FDWPASSWORD="Wh00p33"

createDbAndBaseTables () {
    echo "Creating Database $1 $2"
    psql -h $1 -U postgres -c "CREATE DATABASE \"$2\";"
    psql -h $1 -U postgres -d $2 -c "CREATE SCHEMA $SCHEMA_NAME;"
    if [[ $3 = "LEAD" ]]
    then
        echo "LEAD Testing.sql"
        PGOPTIONS="--search_path=$SCHEMA_NAME" psql -h $1 -U postgres -d $2 <<SCRIPT
CREATE TABLE "Testing" (
    "TestingId" BIGSERIAL
    ,"Period" INT
    ,"Zone" INT
    ,"UTCStart" timestamp NOT NULL
    ,"UTCEnd" timestamp NOT NULL
    ,"Stuff" TEXT
) PARTITION BY LIST ("Zone");
SCRIPT
    else
        echo "ZONE Testing.sql"
        PGOPTIONS="--search_path=$SCHEMA_NAME" psql -h $1 -U postgres -d $2 <<SCRIPT
CREATE TABLE "Testing" (
    "TestingId" BIGSERIAL
    ,"Period" INT
    ,"Zone" INT
    ,"UTCStart" timestamp NOT NULL
    ,"UTCEnd" timestamp NOT NULL
    ,"Stuff" text
    ,PRIMARY KEY ("TestingId","Period","Zone")
);
SCRIPT
    fi
}


#Create each Lead and Zone and customize their pg_hba.conf and postgresql.conf to run and allow access from the correct addresses
for ((i=0; i <= $NUMBER_OF_ZONES; i++ )) 
do
    ((ip=$i+100))
    if [[ $i = 0 ]]; then
        echo "Creating LEADER DB $BASE_IP4$ip"
        DIR="$ROOT_TRGT_DIR/PSQL_LEAD"
    else
        echo "Creating ZONE DB $BASE_IP4$ip"
        DIR="$ROOT_TRGT_DIR/PSQL_$i"
    fi
    mkdir -p "$DIR"
    sudo chown postgres:postgres "$DIR"
    sudo -u postgres initdb -D "$DIR"
    sudo -u postgres mkdir -p "$DIR/sockets"
    if [[ $i = 0 ]]; then
        #BUILD UP pg_hba.conf access list
        ACCESS_LIST="host    all             fdwuser         $BASE_IP4$ip\/32        password"
        ACCESS_LIST="$ACCESS_LIST\nhost    all             postgres        $BASE_IP4$ip\/32        trust"
        ACCESS_LIST="$ACCESS_LIST\nhost    all             postgres        ${BASE_IP4}1\/32        trust"
        ACCESS_LIST="$ACCESS_LIST\nhost    all             postgres        127.0.0.1\/32        trust"
        ACCESS_LIST="$ACCESS_LIST\nhost    all             fdwuser         127.0.0.1\/32        password"
        ACCESS_LIST="$ACCESS_LIST\nhost    all             fdwuser        ${BASE_IP4}1\/32      password"
        for ((k=1; k<= $NUMBER_OF_ZONES; k++ ))
        do
            ((zip=$k+100))
            ACCESS_LIST="$ACCESS_LIST\nhost    all             fdwuser         $BASE_IP4$zip\/32        password"
        done
        sed -i "s/host    all             all             127.0.0.1\/32            trust/$ACCESS_LIST\n/" "$DIR/pg_hba.conf"
        echo "sudo -u postgres pg_ctl -D \"$DIR\" -l \"$DIR/psql.log\" start" > "start_PSQL_LEAD"
        chmod +x "start_PSQL_LEAD"
    else
        ACCESS_LIST="host    all            fdwuser          $BASE_IP4$ip\/32        password"
        ACCESS_LIST="$ACCESS_LIST\nhost    all            postgres         $BASE_IP4$ip\/32        trust"
        ACCESS_LIST="$ACCESS_LIST\nhost    all            postgres         127.0.0.1\/32        trust"
        ACCESS_LIST="$ACCESS_LIST\nhost    all            fdwuser          ${BASE_IP4}1\/32        password"
        ACCESS_LIST="$ACCESS_LIST\nhost    all            fdwuser          127.0.0.1\/32        password"
        ACCESS_LIST="$ACCESS_LIST\nhost    all            postgres         ${BASE_IP4}1\/32        trust"
        sed -i "s/host    all             all             127.0.0.1\/32            trust/$ACCESS_LIST\n/" "$DIR/pg_hba.conf"
        echo "sudo -u postgres pg_ctl -D \"$DIR\" -l \"$DIR/psql.log\" start" > "start_PSQL_$i"
        chmod +x "start_PSQL_$i"
    fi
    sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '$BASE_IP4$ip'/" "$DIR/postgresql.conf"
    sed -i "s/#wal_buffers = -1/wal_buffers = 16MB/" "$DIR/postgresql.conf"
    #wacky ${DIR....} is escaping the front slashes so sed doesn't blow up
    sed -i "s/#unix_socket_directories = '\/run\/postgresql'/unix_socket_directories = '${DIR//\//\\/}\/sockets'/" "$DIR/postgresql.conf"
done

#start servers and build db instances
for ((i=0; i <= $NUMBER_OF_ZONES; i++))
do
    ((ip=$i+100))
    if [[ $i = 0 ]]
    then
        ./start_PSQL_LEAD
        createDbAndBaseTables "$BASE_IP4$ip" "$DBNAME" "LEAD"
    else
        ./start_PSQL_$i
        createDbAndBaseTables "$BASE_IP4$ip" "$DBNAME" "ZONE"
    fi
done

#set up user for lead to access each zone with
for ((i=0; i <= $NUMBER_OF_ZONES; i++))
do
    ((ip=$i+100))
psql -h ${BASE_IP4}$ip -U postgres -d $DBNAME <<SCRIPT
 CREATE EXTENSION postgres_fdw;
 CREATE USER fdwuser WITH ENCRYPTED PASSWORD '$FDWPASSWORD';
 GRANT ALL PRIVILEGES ON SCHEMA $SCHEMA_NAME to fdwuser;
 GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA $SCHEMA_NAME to fdwuser;
 ALTER TABLE "Testing" ALTER COLUMN "Zone" SET DEFAULT $k;
SCRIPT
done

for ((i=0; i <= $NUMBER_OF_ZONES; i++))
do
    ((ip=$i+100))
    if [[ $i = 0 ]] #LEAD LEAD LEAD LEAD LEAD
    then
#       PGOPTIONS="--search_path=$SCHEMA_NAME" psql -U postgres -d $DBNAME -h $BASE_IP4$ip -f "$DBFOLDER/TBL/" ... ...Create lead schema here
#       PGOPTIONS="--search_path=$SCHEMA_NAME" psql -U postgres -d $DBNAME -h $BASE_IP4$ip -f "$DBFOLDER/PROC/" ... ...Create lead schema here
#       PGOPTIONS="--search_path=$SCHEMA_NAME" psql -U postgres -d $DBNAME -h $BASE_IP4$ip -f "$DBFOLDER/VIEW/" ... ...Create lead schema here
        
        for ((k=1; k<= $NUMBER_OF_ZONES; k++ ))
        do
            ((zip=$k+100))
            psql -h ${BASE_IP4}100 -U postgres -d $DBNAME <<SCRIPT
 CREATE SERVER "Zone$k" FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname '$DBNAME', host '$BASE_IP4$zip', port '5432',use_remote_estimate 'true', fetch_size '50000');
 GRANT USAGE ON FOREIGN SERVER "Zone$k" TO fdwuser;
 CREATE USER MAPPING for fdwuser SERVER "Zone$k" OPTIONS (user 'fdwuser', password '$FDWPASSWORD');
 CREATE USER MAPPING for postgres SERVER "Zone$k" OPTIONS (user 'fdwuser', password '$FDWPASSWORD');
 CREATE SCHEMA "Zone$k";
 GRANT ALL PRIVILEGES ON SCHEMA "Zone$k" to fdwuser;
 GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA "Zone$k" to fdwuser;
 GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA $SCHEMA_NAME TO fdwuser;
SCRIPT
            PGOPTIONS="--search_path=$SCHEMA_NAME" PGPASSWORD=$FDWPASSWORD psql -h "${BASE_IP4}100" -U postgres -d $DBNAME -c "IMPORT FOREIGN SCHEMA $SCHEMA_NAME FROM SERVER \"Zone$k\" INTO \"Zone$k\";"
            PGOPTIONS="--search_path=$SCHEMA_NAME" PGPASSWORD=$FDWPASSWORD psql -h "${BASE_IP4}100" -U postgres -d $DBNAME -c "ALTER TABLE \"Testing\" ATTACH PARTITION \"Zone$k\".\"Testing\" FOR VALUES IN ($k);"

        done

    else #ZONE ZONE ZONE ZONE ZONE

    #create a mockup of how zones are setup
    for ((k=1; k<= 12; k++ ))
    do
        printf -v MM "%02d" $k
        PGOPTIONS="--search_path=$SCHEMA_NAME" psql -h $BASE_IP4$ip -U postgres -d $DBNAME <<SCRIPT
CREATE TABLE "Testing2022$k" (CHECK ("Period"='2022$k')) INHERITS("Testing");
ALTER TABLE "Testing2022$k" ALTER COLUMN "Zone" SET DEFAULT $i;

INSERT INTO "Testing2022$k"("Period","UTCStart","UTCEnd","Stuff")
SELECT '2022$k','2022-${MM}-01 00:00:00'::timestamp + (seq || ' Minute')::INTERVAL, '2022-${MM}-01 00:00:00'::timestamp + (seq || ' Minute')::INTERVAL + ('50 Second')::INTERVAL, seq::text from generate_series(0,100000) seq;

CREATE INDEX idx_t2022${k}_utce_utcs ON "Testing2022${k}"("UTCEnd","UTCStart");
CREATE INDEX idx_t2022${k}_utcs_utce ON "Testing2022${k}"("UTCStart","UTCEnd");
CREATE INDEX idx_t2022${k}_a ON "Testing2022${k}"("Stuff");
SCRIPT


    done
#

    fi
done

And here's the resulting size from the lead database with 8 zones of 12 periods each PGOPTIONS="--search-path=wha" psql -U postgres -d TestDB -h 127.0.0.100

select count(*) from "Testing";
  count  
---------
 9600096
(1 row)

Querying directly is pretty fast at ~14ms

select * from "Zone6"."Testing" where "Zone" = 6 AND "UTCStart" > '2022-03-01' ORDER BY "UTCStart" DESC limit 100;
...
   1199913 | 202212 |    6 | 2023-02-08 09:01:00 | 2023-02-08 09:01:50 | 99901
(100 rows)

Time: 13.564 ms

Querying the partition is much slower at ~87ms (but still fast because the dataset is small)

select * from wha."Testing" where "Zone" = 6 AND "UTCStart" > '2022-03-01' ORDER BY "UTCStart" DESC limit 100;

...
   1199913 | 202212 |    6 | 2023-02-08 09:01:00 | 2023-02-08 09:01:50 | 99901
(100 rows)

Time: 87.156 ms

Here are the explains:

From lead partition table

explain (ANALYZE, TIMING, COSTS, BUFFERS, VERBOSE) select * from wha."Testing" where "Zone" = 6 AND "UTCStart" > '2022-03-01' ORDER BY "UTCStart" DESC limit 100;
                                                                                                   QUERY PLAN                                                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=105.26..115.26 rows=100 width=64) (actual time=75.312..75.321 rows=100 loops=1)
   Output: "Testing"."TestingId", "Testing"."Period", "Testing"."Zone", "Testing"."UTCStart", "Testing"."UTCEnd", "Testing"."Stuff"
   ->  Foreign Scan on "Zone6"."Testing"  (cost=105.26..107582.12 rows=1074326 width=37) (actual time=75.309..75.314 rows=100 loops=1)
         Output: "Testing"."TestingId", "Testing"."Period", "Testing"."Zone", "Testing"."UTCStart", "Testing"."UTCEnd", "Testing"."Stuff"
         Remote SQL: SELECT "TestingId", "Period", "Zone", "UTCStart", "UTCEnd", "Stuff" FROM wha."Testing" WHERE (("UTCStart" > '2022-03-01 00:00:00')) AND (("Zone" = 6)) ORDER BY "UTCStart" DESC NULLS FIRST
 Planning Time: 5.638 ms
 Execution Time: 76.320 ms
(7 rows)

Time: 82.503 ms

and the zone directly

explain (ANALYZE, TIMING, COSTS, BUFFERS, VERBOSE) select * from "Zone6"."Testing" where "Zone" = 6 AND "UTCStart" > '2022-03-01' ORDER BY "UTCStart" DESC limit 100;
                                                                                                         QUERY PLAN                                                                                                          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Foreign Scan on "Zone6"."Testing"  (cost=105.26..115.26 rows=100 width=64) (actual time=3.935..3.952 rows=100 loops=1)
   Output: "TestingId", "Period", "Zone", "UTCStart", "UTCEnd", "Stuff"
   Remote SQL: SELECT "TestingId", "Period", "Zone", "UTCStart", "UTCEnd", "Stuff" FROM wha."Testing" WHERE (("UTCStart" > '2022-03-01 00:00:00')) AND (("Zone" = 6)) ORDER BY "UTCStart" DESC NULLS FIRST LIMIT 100::bigint
 Planning Time: 8.193 ms
 Execution Time: 4.277 ms
(5 rows)

Time: 13.373 ms

I'm not very good at parsing explain output but why are the foreign scans on "Zone6" so different on the above 2 explains?

I know that I can write a function to take zones into account and avoid the partition table but the logic will become complicated enough that I am trying to avoid it. Is there anything I'm missing as to why the partition selects are so much slower? Is there anything else I can try speed up querying the partition?

Thanks!

Solution

The difference between the two queries is that in the second query, LIMIT 100 is pushed down to the remote database. That makes the remote database choose an execution plan that delivers the first 100 results quickly. Otherwise, PostgreSQL optimizes for the fastest plan to return the complete result set. Run EXPLAIN for the remote query on the remote database and see if you can detect the difference.

It seems that PostgreSQL does not push down the LIMIT if you query the partitioned table. I didn't read up in the code to figure out why, but it should be possible in principle. Perhaps it is just not implemented.