I am working with hadoop map-reduce. I have to process the data from .xml
file, parse it and store the output into the database.
While working on this when I need to pass my xml to mapper, I found that the XmlInputFormat.class
is not provided by hadoop by default and we have to use mahout's XmlInputFormat for it.
I wonder when Xml is being use vastly, why hadoop haven't provided the XmlInputFormat
for this rather than explicitly creating custom XmlInputFormat bye extending TextInputFormat
for it?
Well even though xml is vastly used, providing framework with special features towards a technology, might not be a good idea. It may be like an endorsement. At high level, Mapreduce is designed to accept different formats. Infact these days json is being used vastly due to its size features compared to xml. Even I had the similar issue.
But its up to the user to decide the input of map reduce and can use, different parsers(Jackson or gson for json and JAXB for xml) if they are in a single line or like above using RecordReader implementation