Search code examples
optimizationhivehiveqlhadoop-partitioninghint

Hive SQL force shuffle


I have a simple query:

Select * 
from A
left join b on A.b = b.b
left join c on A.c = c.c
left join d on A.d = d.d
left join e on A.e = e.e
...
~20 tables

All tables b,c,d,e etc are small and therefore all joins are broadcast joins

The problem is that table A is big (like 300m records) and ill-partitioned. It has ~80 partitions and 15 of them contain 90% of data.

So, when broadcast join happens, only 15 cores do all join math and it takes 2 hours to finish.

My solution was to force shuffling of A by adding "order by"

Select * 
from 
    (select * from A order by A.a) A
left join b on A.b = b.b
left join c on A.c = c.c
left join d on A.d = d.d
left join e on A.e = e.e
...

Here A.a is unique field combination.

Now A is shuffled to 200(by default) semi-equal partitions and resources are used in optimal way, it takes 35 mins to finish.

Is there more elegant way to force shuffle of table A, like some hint magic?


Solution

  • You can try distribute by <some evenly distributed key> in the subquery instead of order by.