Is there something similar to Mahout's XmlInputFormat but for Flink?
I have a large XML file and I want to extract specific elements. In my case it's a wikipedia dump and I need to get all <page>
tags.
I.e. if I have a file
<mediawiki>
<siteinfo>...</siteinfo>
<page>...</page>
<page>...</page>
<page>...</page>
</mediawiki>
I want to get all 3 records <page>...</page>
to be used in mappers. Ideally it should be valid XML, something that the xpath query /mediawiki/page
would return.
Mahout's XmlInputFormat extends Hadoop's TextInputFormat. Flink has generic wrappers for Hadoop InputFormats such that the XmlInputFormat should also be supported.
To read data with Hadoop InputFormats you can do:
DataSet<Tuple2<LongWritable, Text>> input =
env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);
See the documentation for details.