Search code examples
hadoophivemapreducehdfs

Why is hive writing 2 part files to hdfs even though number of mappers and reducers is set to 1


I have a hive insert overwrite query - set mapred.map.tasks=1; set mapred.reduce.tasks=1; insert overwrite table staging.table1 partition(dt) select * from testing.table1;

When I inspect the HDFS directory for staging.table1, I see that there are 2 part files created.

2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000000_0
2019-12-25 02:25 /data/staging/table1/dt=2019-12-24/000001_0

Why is it that 2 files are created?

I am using beeline client and hive 2.1.1-cdh6.3.1


Solution

  • The insert query you executed is map-only, which means there is no reduce task. So there's no point of setting mapred.reduce.tasks.

    Also, the number of mapper is determined by the num of splits, so setting mapred.map.tasks won't change the parallelism of mappers.

    There are at least two feasible ways to enforce the resultant num of files to be 1:

    1. Enforcing a post job for file merging.
      Set hive.merge.mapfiles to be true. Well, the default value is already true.
      Decrease hive.merge.smallfiles.avgsize to actually trigger the merging.
      Increase hive.merge.size.per.task to be big enough as the target size after merging.
    2. Configuring the file merging behavior of mappers to cut down num of mappers.
      Make sure that hive.input.format is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat, which is also the default.
      Then increase mapreduce.input.fileinputformat.split.maxsize to allow larger split size.