Search code examples
hadoophivecloudera

Hive job taking too long to read data and insert in partitioned sorted bucketed table


We have a job that reads from a hive table with around 3billion rows and inserts in a sorted bucketed table.

Files in both source and destination tables are having parquet format.

This job is taking too long to finish. We have had to stop the job after 3 days.

We recently migrated to a new cluster. The older cluster was 5.12 and the latest cluster is 6.3.1. This job used to run fine and finish within 6 hours in the 5.12 cluster. However, it's taking too long in the new cluster.

We have tried the following things to solve this without any results:-

  1. Removed the cap on reducers. Removed set hive.exec.reducers.max=200;
  2. set mapreduce.job.running.reduce.limit=100;
  3. Merged files at the source to make sure we are not reading small files. File size in the source table was increased to 1G each.
  4. Reduce the no. of rows in the source table to reduce the data mappers are reading.
  5. Reduce the max split size to 64MB to increase the no. of mappers.
  6. Insert in a new table.
  7. Insert in a new table that is not sorted or bucketed.

The query we are trying to run :-

set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.created.files=900000;

set mapreduce.input.fileinputformat.split.maxsize=64000000;
set mapreduce.job.running.reduce.limit=100;

set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;

INSERT OVERWRITE TABLE dbname.features_archive_new PARTITION (feature, ingestmonth)
Select mpn,mfr,partnum,source,ingestdate,max(value) as value,feature,ingestmonth
from dbname.features_archive_tmp
where feature = 'price'
and ingestmonth like '20%'
group by mpn,mfr,partnum,source,ingestdate,feature,ingestmonth;

Solution

  • We found out that hive version 2.x in Cloudera 6.3 is using vectorization while hive 1.x in old Cloudera 5.12 is not using it.

    So setting the below property fixed the issue for us. I have no explanations for this. Vectorization should speed up the query and not make it slow.

    hive.vectorized.execution.enabled=false;