I'm trying to parse a large XML file with Java, a chunk at a time, so that the server doesn't have to store the whole file in memory.
My Javascript code slices the file using the File API slice function and sends about 2mb at a time to the server. I'm using AppEngine, so I can't save to disc.
For example, chunk one:
<message:DataSet>
<series>...</series>
<series>...</series>
<series>...</series> (and so on, thousands)
Chunk two, three etc until eof:
<series>...</series>
<series>...</series>
<series>...</series> (more)
Is there a parser of some type where a context/state/cursor could be saved so that parsing could be resumed with the additional chunks of data?
Or, otherwise, is there a solution that can parse large XML files without loading the whole file into memory?
parser = new Parser(previousState);
parser.parse(moreData);
For anyone with similar requirements, I came across the Aalto XML processor, which is almost exactly what I was after. It features so-called non-blocking (asynchronous) XML parsing. It adds a special event to StAX, EVENT_INCOMPLETE, which allows more input to be fed-in later.
For example:
<root>value</root>
AsyncXMLInputFactory inputF = new InputFactoryImpl();
//Parse part 1
byte[] input_part1 = "<root>val".getBytes("UTF-8");
AsyncXMLStreamReader<AsyncByteArrayFeeder> parser = inputF.createAsyncFor(input_part1);
//Process events here
//Parse part 2
byte[] input_part2 = "ue</root>".getBytes("UTF-8");
parser.getInputFeeder().feedInput(input_part2);
//Process more events here
Larger example here
Aalto XML project page on GitHub here
Update: There is also, Woodstox, which has even more features, including P_INPUT_PARSING_MODE, which allows for more lenient parsing (eg multiple root elements). Both solutions are from FasterXML.