I've a log file with the following structure.
unstructured raw text
unstructured raw text
..
..
..
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<message>
...
...
</message>
unstructured raw text
..
..
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<message>
...
...
</message>
unstructured raw text
..
..
As you can see there are multiple XML documents embedded inside one single log file. I was wondering if there is a generic utility or library that I can reuse here before I start to write something of my own. I need it in Java.
Thanks.
I would favour one of the StAX based parsers, the Woodstox ones are particularly performant. If you then need to use a different type of XML parser you can shunt the events from the parser to a generator and feed that XML into e.g. a DOM based parser or a SAX based parser (if you are a masochist... since SAX is a pain of a parser to use).
You will have pseudo-code that looks a little like this:
BufferedReader br = ...
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
Pattern startOfXml = Pattern.compile("<\\?xml.*\\?>");
String line;
while (null != (line = br.readLine()) {
if (startOfXml.matcher(line).matches()) {
XMLEventReader xr = inputFactory.createXMLEventFactory(br);
XMLEvent event;
while (!(event = xr.nextEvent()).isEndDocument()) {
// do whatever you want with the event
}
} else {
// do whatever you want with the plain-text
}
}
Some of the StAX parsers in certain modes may object to the isEndDocument() and in that case you will have to count event level parsing the document and break out once you reach the root level end element. Also some parsers may cache a few characters after the end of the document... worst case you just need to catch an exception for a "malformed" document when the parser notices text after the end element