Search code examples
javaxml-parsingspark-streamingapache-storm

XML parse with Storm streaming and spark streaming


How can I parse XML data in Storm and Spark streaming? For example in Spark streaming;

// Define spark streaming MAP function.
private static final Function<XML_DOCUMENT_TYPE, MY_JAVA_CLASS> parsingXMLFunc = (doc -> {
    // create my java object
    MY_JAVA_CLASS mjc = new MY_JAVA_CLASS();         

    // classic xml parsing  
    List<String> parsed_doc = doc.parse(); // etc
    mjc.temperature = parsed_doc[0];
    mjc.accelerometer = parsed_doc[1];

    return mjc;           
});

In this example, can Spark parse xml in parallel?

Or Storm streaming example;

@Override
public void execute(Tuple tuple) {
    // create my java object
    MY_JAVA_CLASS mjc = new MY_JAVA_CLASS();         

    // classic xml parsing
    Document doc = tuple.get(0);
    List<String> parsed_doc = doc.parse(); // etc
    mjc.temperature = parsed_doc[0];
    mjc.accelerometer = parsed_doc[1];

    _collector.emit(new Values(mjc));  
};

In the above examples, is the XML parse operation done in parallel? Or do you have better approachs?


Solution

  • I haven't worked in Spark. Regarding Storm, you can create a function to do XML parsing (using some common java XML parser's you prefer) & call that function inside "execute" method. This will run in parallel depending upon number of workers & executors you provide for your application.