From the Apache doc on the Hadoop MapReduce InputFormat Interface:
"[L]ogical splits based on input-size is insufficient for many applications since record boundaries are to be respected. In such cases, the application has to also implement a RecordReader on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task."
Is the WordCount example application one in which logical splits based on input size are insufficient? If so, where in the source code is the implementation of a RecordReader found?
Input splits are logical references to data. If you look at the API, you can see that it doesn't know anything about the record boundaries. A mapper is launched for every input split. A mapper's map()
is run for every record(In a WordCount program, every line in a file).
But how does a mapper know where the record boundaries are?
This is where your quote from Hadoop MapReduce InputFormat Interface comes in -
the application has to also implement a RecordReader on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task
Every mapper is associated with an InputFormat. That InputFormat
has information on which RecordReader
to use. Look at the API, you will find that it knows about the input splits and what record reader to use.
If you want to know some more about input splits and Record reader, you should read this answer.
A RecordReader
defines what the record boundaries are; The InputFormat
defines what RecordReader
is used.
The WordCount program does not specify any InputFormat
, it therefore defaults to TextInputFormat
which uses LineRecordReader and gives out every line as a different record. And this your source code
[L]ogical splits based on input-size is insufficient for many applications since record boundaries are to be respected.
What this means is that, for an example file such as
a b c d e
f g h i j
k l m n o
and we want every line to be a record. when the logical splits are based on input size, it is possible there may be two splits such as:
a b c d e
f g
and
h i j
k l m n 0
If it wasn't for the RecordReader
, it would've considered f g
and h i j
to be different records; Clearly, this isn't what most applications want.
Answering your question, in the WordCount program, it does not really matter what the record boundaries are but there is a possibility that the same word is split into different logical splits. Therefore, logical splits based on size are not sufficient for WordCount program.
Every MapReduce program 'respects' record boundaries. Otherwise, it is of not much use.