Search code examples
hadoopsplitnewlinerecordcascading

Cascade Hadoop file loading - method to deal with records crossing a newline?


I'm working with Hadoop to process some files distributed across a cluster of JVM instances.

I'm using the Cascading library to interface to Hadoop.

I want to parse a text file where the records cross newlines and are terminated by a period: (.)

(I'm aware this is so small the benefits of Hadoop are not realised - I'm working on a demo).

From what I can see - I'd need to write a custom InputFormat to handle this.

My question is - is it better# to:

(a) have a pre-processing step on my input data to strip out the newlines and then insert a newline after the end of each record?

(b) Write a custom InputFormat?

# By 'better' - I mean less work and more idiomatic.


Solution

  • It is actually for you to decide weighing the pros and cons of each approach with regards to your requirements. But personally I advice to write a custom InputFormat and RecordReader to read in your input data if you have to write the pre processing application just to transform one single type (records cross newlines and are terminated by a period) of text files. The pre processor would be ideal if you expect more unorthodox text file formats to come in the future so that pre processor can transform all the different formats into an intermediate format before sending to Map/Reduce.

    Read this tutorial to learn how to write custom InputFormat and RecordReader.