Search code examples
postgresqlgreenplum

PostgreSQL/GreenPlum partition elimination and left join


Is it possible to use partition elimination with a left outer join to the partitioned table?

My understanding is that partition elimination only works if the partition key is in the where clause, so where right_table.date_key = '2016-02-01' would do the partition elimination, but that is incompatible with a left join as it would eliminate any rows that are not present in the right_table.

If I put where (right_table.date_key = '2016-02-02' or right_table.date_key is null) then it doesn't do any partition elimination.

I've been asked to post the full query, so here's a cut down version (the real thing is huge with dozens of columns, a couple more tables, some big case statements, and confidential client business logic):

select voyage.std -- timestamp
     , person.name
     , fact1.score score_1
     , fact2.score score_2
from fact1
     join voyage on voyage.voyage_sk = fact1.voyage_sk
     join person on person.person_sk = fact1.person_sk
left join fact2  on fact2.person_sk  = person.person_sk
where voyage.std = '2016-02-02 14:33:00'

So fact1 always exists, but fact2 is optional. None of the tables are partitioned.

Now for partitioning I am adding a new column, voyage_sdd, which is the date portion of voyage.std. I partition the fact tables and the voyage table on the new date column. Query then becomes this:

select voyage.std -- timestamp
     , person.name
     , fact1.score score_1
     , fact2.score score_2
from fact1
     join voyage on voyage.voyage_sk = fact1.voyage_sk
     join person on person.person_sk = fact1.person_sk
left join fact2  on fact2.person_sk  = person.person_sk
where voyage.std = '2016-02-02 14:33:00'
and voyage.voyage_sdd = '2016-02-02'
and fact1.voyage_sdd = '2016-02-02'
and fact2.voyage_sdd = '2016-02-02'

This last line makes fact2 an inner join. If I leave the last line out, then the query still works and returns the correct data, but it is less efficient than the non-partitioned query because it has to scan all the partitions. If I leave fact2 unpartitioned then I am getting a slight performance improvement in our test environment which only has a small data set, I am hoping that this will improve when we get some more disk space and a representative amount of data in test.

So to reiterate my question, how can I partition fact2 and still have a left join to it?

Update This works:

select voyage.std -- timestamp
     , person.name
     , fact1.score score_1
     , fact2.score score_2
from voyage 
     join person on person.person_sk = fact1.person_sk
     join fact1  on fact1.voyage_sk  = voyage.voyage_sk and fact1.voyage_sdd = voyage.voyage_sdd
left join fact2  on fact2.person_sk  = person.person_sk and fact2.voyage_sdd = voyage.voyage_sdd
where voyage.std = '2016-02-02 14:33:00'
and voyage.voyage_sdd = '2016-02-02'

The optimizer knows that the fact2 (and fact1) tables are partitoned on the join key, and that since the voyage table has a constraint on the join key, the fact table partitions can be eliminated.


Solution

  • First, where (right_table.date_key = '2016-02-02' or right_table.date_key is null) the or condition on the NULL might be the issue blocked partition elimination.

    Second, for the question of "how to partition f2". Most time, I always partition on 'date', since most of DW queries will have a predicate to narrow down the 'date'. Like you did at the last line fact2.voyage_sdd = '2016-02-02'.

    Furthermore, I would include ALL the partition columns in the 'join' column if that's following your business logic. In that case, if optimizer support dynamic partition elimination through joins, like GPORCA (http://pivotal.io/big-data/white-paper/optimizing-queries-over-partitioned-tables-in-mpp-systems), then you can benefit from it.

    Hope that answered your question.