We have a job that reads from a hive table with around 3billion rows and inserts in a sorted bucketed table.
Files in both source and destination tables are having parquet format.
This job is taking too long to finish. We have had to stop the job after 3 days.
We recently migrated to a new cluster. The older cluster was 5.12 and the latest cluster is 6.3.1. This job used to run fine and finish within 6 hours in the 5.12 cluster. However, it's taking too long in the new cluster.
We have tried the following things to solve this without any results:-
The query we are trying to run :-
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.created.files=900000;
set mapreduce.input.fileinputformat.split.maxsize=64000000;
set mapreduce.job.running.reduce.limit=100;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
INSERT OVERWRITE TABLE dbname.features_archive_new PARTITION (feature, ingestmonth)
Select mpn,mfr,partnum,source,ingestdate,max(value) as value,feature,ingestmonth
from dbname.features_archive_tmp
where feature = 'price'
and ingestmonth like '20%'
group by mpn,mfr,partnum,source,ingestdate,feature,ingestmonth;
We found out that hive version 2.x in Cloudera 6.3 is using vectorization while hive 1.x in old Cloudera 5.12 is not using it.
So setting the below property fixed the issue for us. I have no explanations for this. Vectorization should speed up the query and not make it slow.
hive.vectorized.execution.enabled=false;