Search code examples
sqlpostgresqlindexingouter-joinpostgresql-performance

FULL JOIN with =any doesn't use indexes


Using Postgres 9.3.5, I can't seem to get a full outer join with an =any where-clause to use the relevant indexes.

A minimal example:

create table t1(i int primary key, j int);
create table t2(i int primary key, j int);

insert into t1 (select x,x from generate_series(1,1000000) x);
insert into t2 (select x,x from generate_series(1,1000000) x);

vacuum analyze;

explain analyze
    select * 
        from t1 full join t2 using(i) 
        where i =any (array[1,2]);

(In my real query, the array is a parameter and has variable length)

I get the following seq-scanning query plan:

 Hash Full Join  (cost=26925.00..66350.00 rows=10000 width=16) (actual time=178.308..1251.221 rows=2 loops=1)
   Hash Cond: (t1.i = t2.i)
   Filter: (COALESCE(t1.i, t2.i) = ANY ('{1,2}'::integer[]))
   Rows Removed by Filter: 999998
   ->  Seq Scan on t1  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.011..59.463 rows=1000000 loops=1)
   ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8) (actual time=178.212..178.212 rows=1000000 loops=1)
         Buckets: 131072  Batches: 1  Memory Usage: 39063kB
         ->  Seq Scan on t2  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.012..57.751 rows=1000000 loops=1)
 Total runtime: 1255.734 ms

Unsuccessful things I tried:

  • Use i in (1,2) or i=1 or i=2 instead of =any
  • set enable_seqscan to f

Simulating the full join with a left join and an anti-join works:

explain analyze 
    select * from
        (select i, t1.j, t2.j from t1 left join t2 using(i) 
         union all
         select i, null, j from t2 
             where not exists (select 1 from t1 where t1.i = t2.i)) sub
    where i =any (array[1,2]);


 Append  (cost=0.85..51.61 rows=3 width=12) (actual time=0.007..0.018 rows=2 loops=1)
   ->  Nested Loop Left Join  (cost=0.85..29.79 rows=2 width=12) (actual time=0.007..0.010 rows=2 loops=1)
         ->  Index Scan using t1_pkey on t1  (cost=0.42..12.88 rows=2 width=8) (actual time=0.003..0.005 rows=2 loops=1)
               Index Cond: (i = ANY ('{1,2}'::integer[]))
         ->  Index Scan using t2_pkey on t2  (cost=0.42..8.44 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=2)
               Index Cond: (t1.i = i)
   ->  Nested Loop Anti Join  (cost=0.85..21.79 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=1)
         ->  Index Scan using t2_pkey on t2 t2_1  (cost=0.42..12.88 rows=2 width=8) (actual time=0.001..0.002 rows=2 loops=1)
               Index Cond: (i = ANY ('{1,2}'::integer[]))
         ->  Index Only Scan using t1_pkey on t1 t1_1  (cost=0.42..4.44 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=2)
               Index Cond: (i = t2_1.i)
               Heap Fetches: 0
 Total runtime: 0.065 ms

This approach would strongly complicate and add duplication to my real query, though. Is there any better way to get Postgres to use the indexes?


Solution

  • Pushing down the predicate into subqueries does the trick:

    EXPLAIN ANALYZE
    SELECT * 
    FROM      (SELECT * FROM t1 WHERE i = ANY ('{1,2}')) t1
    FULL JOIN (SELECT * FROM t2 WHERE i = ANY ('{1,2}')) t2 USING (i);
    
    QUERY PLAN
    Merge Full Join (cost=0.58..25.26 rows=2 width=16) (actual time=0.084..0.100 rows=2 loops=1)
      Merge Cond: (t1.i = t2.i)
        -> Index Scan using t1_pkey on t1 (cost=0.29..12.62 rows=2 width=8) (actual time=0.044..0.048 rows=2 loops=1)
             Index Cond: (i = ANY ('{1,2}'::integer[]))
        -> Index Scan using t2_pkey on t2 (cost=0.29..12.62 rows=2 width=8) (actual time=0.028..0.033 rows=2 loops=1)
             Index Cond: (i = ANY ('{1,2}'::integer[]))
    Total runtime: 0.256 ms
    

    SQL Fiddle (with 100k rows).

    Obviously, the query planner is not smart enough to conclude that indexes on the underlying tables can be used from a predicate on the column after the full join. This could be improved.

    Can't test pg 9.4 right now. Maybe it has been improved.

    BTW, most clients can't deal with multiple columns in the result using the same name (even though Postgres can do this). Your two instances of j would be a problem and you'd have to use at least one column alias, forcing you to list columns explicitly.