Search code examples
hadoopmapreducerecordreader

How does hadoop RecordReader identify records


When processing text file how does hadoop identify records ? Is it based on newline characters or full stops ?

If I have a text file list of 5000 words, all on single line, separated by space; no new line characters, commas or full stops. How will RecordReader behave ?

e.g. abc pqr xyz lmn qwe rew poio kjkh ascd lkyg ......


Solution

  • You can set the delimiter in the config with textinputformat.record.delimiter.

    If it isn't supplied it will fallback to split the lines based on one of the following: '\n' (LF) , '\r' (CR), or '\r\n' (CR+LF). So your example line will be read as a single record.

    You can read through the code of the LineReader, TextInputFormat and LineRecordReader for more details.