Search code examples
hadoopmapreduceinput-split

Does hadoop job submitter while calculating splits takes record boundries into account?


This question is NOT a duplicate of: How does Hadoop process records split across block boundaries?

I've one question regarding the input split calculation. As per the hadoop guide

1) the InputSplits respect record boundaries

2) At the same time it say that splits are calculated by Job Submitter. Which I assume runs on the client side. [Anatomy of a MapReduce Job Run - Classic MRv1]

Does this mean that :

(a) job submitter reads blocks to calculate input splits? If this is the case then wont it be very inefficient and beat the very purpose of hadoop.

Or

(b) Does the job submitter just calculates splits that are merely an estimate based up on block sizes and location and Then does it become the InputFormat and RecordReader's responsibility running under mapper to get records across the host boundary.

Thanks


Solution

  • (a) job submitter reads blocks to calculate input splits? If this is the case then wont it be very inefficient and beat the very purpose of hadoop.

    I don't think so. The job submitter should read blocks' information from the name node and then merely do the calculation, which should not use much computing resource.

    (b) Does the job submitter just calculates splits that are merely an estimate based up on block sizes and location and Then does it become the InputFormat and RecordReader's responsibility running under mapper to get records across the host boundary.

    I am not sure how accurate a submitter's calculation is but the split size is calculated based on the configured minimum and maximum split size as well as the block size using this equation

    max(minimumSplitSize, min(maximumSplitSize, blockSize))

    all these values can be set by users. For example, the minimum split size can be 1 and the maximum value can be the max long value (9223372036854775807).

    correct - records in an InputFormat is a logic conception. This means as developers when we develop map reduce code, we don't need to consider the case that a record is separated into 2 different splits. The record reader is in charge of reading the missing information via remote read. This may cause some overhead but usually it is slight.