Search code examples
javaxmlparsingsax

What to do when a huge XML document is not well formed (Java)


I am using Java SAX parser to parse XML data sent from a third party source that is around 3 GB. I am getting an error resulting from the XML document not being well formed: The processing instruction target matching "[xX][mM][lL]" is not allowed.

As far as I understand, this is normally due to a character being somewhere it should not be.

Main problem: Cannot manually edit these files due to their very large size.

I was wondering if there was a workaround for files that are very large in size that cannot be opened and edited manually (due to their large size) and if there is a way to code it so that it would remove any problematic characters automatically.


Solution

  • I would think the most likely explanation is that the file contains a concatenation of several XML documents, or perhaps an embedded XML document: either way, an XML declaration that isn't at the start of the file.

    A lot now depends on your relationship with the supplier of the bad data. If they sent you faulty equipment or buggy software, you would presumably complain and ask them to fix it. But if you don't have a service relationship with the third party, you either have to change supplier or do the best you can with the faulty input, which means repairing the fault yourself. In general, you can't repair faulty XML unless you know what kind of fault you are looking for, and that can be very difficult to determine if the files are large (or if the failures are very rare).

    The data isn't XML, so don't try to use XML tools to process it. Use text processing tools such as sed or awk. The first step is to search the file for occurrences of <?xml and see if that gives any hints.