I have a file with size 39MB, I set the block size as 36MB. When the file is uploaded to HDFS, it successfully stores the file in two blocks. Now when I run a Map-Reduce job(simple reading job) on this file, the job counters show : "INFO mapreduce.JobSubmitter: number of splits:1"
That is , it is considering the 2 blocks as a single split, so I looked around and found the formula for calculating the split size which is as follows:
split size = max(minsize,min(maxsize,blocksize))
where minsize=mapreduce.input.fileinputformat.split.minsize and maxsize=minsize=mapreduce.input.fileinputformat.split.maxsize.
Now in my MR code I set the following properties:
Configuration conf = new Configuration()
conf.set("mapreduce.input.fileinputformat.split.minsize","1")
conf.set("mapreduce.input.fileinputformat.split.maxsize","134217728")
That is minsize=1 byte and maxsize=128 MB, so according to the formula the split size should be 36MB and hence two splits should be there, but still I am getting the same counter output as :
"INFO mapreduce.JobSubmitter: number of splits:1"
Can anyone explain why ?
The last split of a file can overflow by 10%.
This is called as SPLIT_SLOP
and it is set at 1.1
.
In this scenario,
39MB (Remaining Bytes) / 36MB (Input Split Size) = 1.08 is less than 1.1 (SPLIT_SLOP)
Thus the entire file is considered as one split.
Snippet on how splits are divided,
long bytesRemaining = FileSize;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,splitHosts[0], splitHosts[1]));
bytesRemaining -= splitSize;
}
Refer getSplits() method to know how splits are divided for each file.