Search code examples
javasaxstax

How to parse large XML file with Java, chunk by chunk


I'm trying to parse a large XML file with Java, a chunk at a time, so that the server doesn't have to store the whole file in memory.

My Javascript code slices the file using the File API slice function and sends about 2mb at a time to the server. I'm using AppEngine, so I can't save to disc.

For example, chunk one:

<message:DataSet>
   <series>...</series>
   <series>...</series>
   <series>...</series> (and so on, thousands)

Chunk two, three etc until eof:

   <series>...</series>
   <series>...</series>
   <series>...</series> (more)

Is there a parser of some type where a context/state/cursor could be saved so that parsing could be resumed with the additional chunks of data?

Or, otherwise, is there a solution that can parse large XML files without loading the whole file into memory?

parser = new Parser(previousState);
parser.parse(moreData);

Solution

  • For anyone with similar requirements, I came across the Aalto XML processor, which is almost exactly what I was after. It features so-called non-blocking (asynchronous) XML parsing. It adds a special event to StAX, EVENT_INCOMPLETE, which allows more input to be fed-in later.

    For example:

    <root>value</root>
    
    AsyncXMLInputFactory inputF = new InputFactoryImpl();
    
    //Parse part 1
    byte[] input_part1 = "<root>val".getBytes("UTF-8");
    AsyncXMLStreamReader<AsyncByteArrayFeeder> parser = inputF.createAsyncFor(input_part1);
    
    //Process events here
    
    //Parse part 2
    byte[] input_part2 = "ue</root>".getBytes("UTF-8");
    parser.getInputFeeder().feedInput(input_part2);
    
    //Process more events here
    

    Larger example here

    Aalto XML project page on GitHub here

    Update: There is also, Woodstox, which has even more features, including P_INPUT_PARSING_MODE, which allows for more lenient parsing (eg multiple root elements). Both solutions are from FasterXML.