hadoop hive mapreduce query-optimization window-functions

Hive Window Function ROW_NUMBER without Partition BY Clause on a large (50 GB) dataset is very slow. Is there a better way to optimize?

I have a HDFS file with 50 Million records and raw file size is 50 GB.

I am trying to load this in a hive table and create unique id for all rows using the below, while loading. I am using Hive 1.1.0-cdh5.16.1.

row_number() over(order by event_id, user_id, timestamp) as id

While executing I see that in the reduce step, 40 reducers are assigned. Average time for 39 Reducers is about 2 mins whereas the last reducer takes about 25 mins which clearly makes me believe that most of the data is processed in one reducer.

I suspected Order By clause to be the reason for this behavior and tried the below,

row_number() over() as id

Yet, I see the same behavior.

Thinking about the Map Reduce Paradigm, it makes me feel that if we do not specify a Partition BY Clause, the data has to be processed in one reducer (un-distributed) in order to see all rows and attach the correct row number. This could be true for any Window function with no partition By clause or partition By on skewed column.

Now, my question is, how do we circumvent this problem and optimize window functions when we have to avoid Partition BY clause?

Solution

You can use UUID:

select java_method('java.util.UUID','randomUUID')

UUID generated in your system/workflow will be also unique in some other system because UUID is globally unique. UUID works fully distributed and fast.

Also in Hive 3.x there is SURROGATE_KEY function which you can use in the DDL