I have a hive insert overwrite query - set mapred.map.tasks=1; set mapred.reduce.tasks=1; insert overwrite table staging.table1 partition(dt) select * from testing.table1;
When I inspect the HDFS directory for staging.table1, I see that there are 2 part files created.
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000000_0
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000001_0
Why is it that 2 files are created?
I am using beeline client and hive 2.1.1-cdh6.3.1
The insert
query you executed is map-only, which means there is no reduce task. So there's no point of setting mapred.reduce.tasks
.
Also, the number of mapper is determined by the num of splits, so setting mapred.map.tasks
won't change the parallelism of mappers.
There are at least two feasible ways to enforce the resultant num of files to be 1:
hive.merge.mapfiles
to be true. Well, the default value is already true.hive.merge.smallfiles.avgsize
to actually trigger the merging.hive.merge.size.per.task
to be big enough as the target size after merging.hive.input.format
is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
, which is also the default.mapreduce.input.fileinputformat.split.maxsize
to allow larger split size.