Search code examples
xmlhadoopmapreducegzipmahout

Does Mahout's XmlInputFormat handle gzip compressed files without rewriting?


Can Mahout's XmlInputFormat handle gzipped data without overriding any of its methods? I've been attempting to parse wikipedia xml data that is gzipped, and so far have been unsuccessful.

I've heard that Hadoop is able to handle gzipped files automatically, but I assume now that this is contained within the TextInputFormat class or is specific to other input formats, and is not built into Mahout's input format. But maybe I've missed something.

Note: I've since been able to parse the xml, but I was never able to find a clear answer on this and was surprised I had such a hard time looking for one. Hopefully somebody smarter can enlighten me & others.


Solution

  • As per this {code} there is no codec handled, without overriding i don't think it's possible.

    Incase of LineRecordReader it looks something like this {code} and based on file extension it does apply codec.

    You can still give a try by using WikipediaPageInputFormat by cloud9 {here}

    And they have this {codec} handled, check if it works for you.