Is it possible to use partition elimination with a left outer join to the partitioned table?
My understanding is that partition elimination only works if the partition key is in the where clause, so where right_table.date_key = '2016-02-01'
would do the partition elimination, but that is incompatible with a left join as it would eliminate any rows that are not present in the right_table.
If I put where (right_table.date_key = '2016-02-02' or right_table.date_key is null)
then it doesn't do any partition elimination.
I've been asked to post the full query, so here's a cut down version (the real thing is huge with dozens of columns, a couple more tables, some big case statements, and confidential client business logic):
select voyage.std -- timestamp
, person.name
, fact1.score score_1
, fact2.score score_2
from fact1
join voyage on voyage.voyage_sk = fact1.voyage_sk
join person on person.person_sk = fact1.person_sk
left join fact2 on fact2.person_sk = person.person_sk
where voyage.std = '2016-02-02 14:33:00'
So fact1
always exists, but fact2
is optional. None of the tables are partitioned.
Now for partitioning I am adding a new column, voyage_sdd
, which is the date portion of voyage.std
. I partition the fact tables and the voyage table on the new date column. Query then becomes this:
select voyage.std -- timestamp
, person.name
, fact1.score score_1
, fact2.score score_2
from fact1
join voyage on voyage.voyage_sk = fact1.voyage_sk
join person on person.person_sk = fact1.person_sk
left join fact2 on fact2.person_sk = person.person_sk
where voyage.std = '2016-02-02 14:33:00'
and voyage.voyage_sdd = '2016-02-02'
and fact1.voyage_sdd = '2016-02-02'
and fact2.voyage_sdd = '2016-02-02'
This last line makes fact2
an inner join. If I leave the last line out, then the query still works and returns the correct data, but it is less efficient than the non-partitioned query because it has to scan all the partitions. If I leave fact2
unpartitioned then I am getting a slight performance improvement in our test environment which only has a small data set, I am hoping that this will improve when we get some more disk space and a representative amount of data in test.
So to reiterate my question, how can I partition fact2 and still have a left join to it?
Update This works:
select voyage.std -- timestamp
, person.name
, fact1.score score_1
, fact2.score score_2
from voyage
join person on person.person_sk = fact1.person_sk
join fact1 on fact1.voyage_sk = voyage.voyage_sk and fact1.voyage_sdd = voyage.voyage_sdd
left join fact2 on fact2.person_sk = person.person_sk and fact2.voyage_sdd = voyage.voyage_sdd
where voyage.std = '2016-02-02 14:33:00'
and voyage.voyage_sdd = '2016-02-02'
The optimizer knows that the fact2 (and fact1) tables are partitoned on the join key, and that since the voyage table has a constraint on the join key, the fact table partitions can be eliminated.
First, where (right_table.date_key = '2016-02-02' or right_table.date_key is null)
the or
condition on the NULL might be the issue blocked partition elimination.
Second, for the question of "how to partition f2". Most time, I always partition on 'date', since most of DW queries will have a predicate to narrow down the 'date'. Like you did at the last line fact2.voyage_sdd = '2016-02-02'
.
Furthermore, I would include ALL the partition columns in the 'join' column if that's following your business logic. In that case, if optimizer support dynamic partition elimination through joins, like GPORCA (http://pivotal.io/big-data/white-paper/optimizing-queries-over-partitioned-tables-in-mpp-systems), then you can benefit from it.
Hope that answered your question.