Search code examples
xmlmahoutapache-flink

XmlInputFormat for Apache Flink


Is there something similar to Mahout's XmlInputFormat but for Flink?

I have a large XML file and I want to extract specific elements. In my case it's a wikipedia dump and I need to get all <page> tags.

I.e. if I have a file

<mediawiki>
  <siteinfo>...</siteinfo>
  <page>...</page>
  <page>...</page>
  <page>...</page>
</mediawiki>

I want to get all 3 records <page>...</page> to be used in mappers. Ideally it should be valid XML, something that the xpath query /mediawiki/page would return.


Solution

  • Mahout's XmlInputFormat extends Hadoop's TextInputFormat. Flink has generic wrappers for Hadoop InputFormats such that the XmlInputFormat should also be supported.

    To read data with Hadoop InputFormats you can do:

    DataSet<Tuple2<LongWritable, Text>> input =
      env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);
    

    See the documentation for details.