Search code examples
javaxmlstreamingxml-parsingstax

Can I have a less validating StAX parser in Java?


I have the following invalid XML file:

<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 595, 842" media_box="0, 0, 595, 842" rotate="0">
    <Flow id="1">
        <Para id="1">
            <Line box="90, 754.639, 120.038, 12">
                <Word box="90, 754.639, 22.6704, 12">This</Word>
            </Line>
        </Para>
    </Flow>
</Page>
<?xml version="1.0" encoding="utf-8" ?>
<Page num="1" crop_box="0, 0, 595, 842" media_box="0, 0, 595, 842" rotate="0">
    <Flow id="1">
        <Para id="1">
            <Line box="90, 754.639, 120.038, 12">
                <Word box="90, 754.639, 22.6704, 12">This</Word>
            </Line>
        </Para>
    </Flow>
</Page>

While it is structurally invalid (it has two root elements and the XML prologue shows up twice), it can still be correctly parsed (ie. the tags are correct and content is also correct).

So, the question is, is there a StAX (or any other streaming based) XML parser in Java that would allow me to do that? I have checked all options in XMLInputFactory but none of them seem to allow the parser to accept this kind of malformed XML.


Solution

  • i seriously doubt you will be able to get any standard java tool to parse the documents as is. however, you could find the boundaries yourself and parse the individual documents. just look for occurrences of "<?xml".