My data format uses \0 instead of new line. So default hadoop textLine reader dosn't work. How can I configure it to read lines separated by special character?
If it is impossible to configure LineReader, Maybe it is possible to apply specic stream processor(tr "\0" "\n") not sure how to do this.
You can write your own InputFormat class that splits data on \0
instead of \n
. For a walkthrough on how to do that, check here: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
The gist of it is that you need to subclass the default InputFormat class, or any of its subclasses, and define your own RecordReader
with custom rules. For more on that, you can refer to the InputFormat documentation.