Search code examples
mapreducehadoop2

Creating input splits (HADOOP)


I have a file with size 39MB, I set the block size as 36MB. When the file is uploaded to HDFS, it successfully stores the file in two blocks. Now when I run a Map-Reduce job(simple reading job) on this file, the job counters show : "INFO mapreduce.JobSubmitter: number of splits:1"

That is , it is considering the 2 blocks as a single split, so I looked around and found the formula for calculating the split size which is as follows:

split size = max(minsize,min(maxsize,blocksize))

where minsize=mapreduce.input.fileinputformat.split.minsize and maxsize=minsize=mapreduce.input.fileinputformat.split.maxsize.

Now in my MR code I set the following properties:

Configuration conf = new Configuration()
conf.set("mapreduce.input.fileinputformat.split.minsize","1")
conf.set("mapreduce.input.fileinputformat.split.maxsize","134217728")

That is minsize=1 byte and maxsize=128 MB, so according to the formula the split size should be 36MB and hence two splits should be there, but still I am getting the same counter output as :

"INFO mapreduce.JobSubmitter: number of splits:1"

Can anyone explain why ?


Solution

  • The last split of a file can overflow by 10%. This is called as SPLIT_SLOP and it is set at 1.1.

    In this scenario,

    39MB (Remaining Bytes) / 36MB (Input Split Size) = 1.08 is less than 1.1 (SPLIT_SLOP)
    

    Thus the entire file is considered as one split.

    Snippet on how splits are divided,

    long bytesRemaining = FileSize;
    while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
      String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,length-bytesRemaining, splitSize, clusterMap);
      splits.add(makeSplit(path, length-bytesRemaining, splitSize,splitHosts[0], splitHosts[1]));
      bytesRemaining -= splitSize;
    }
    

    Refer getSplits() method to know how splits are divided for each file.