Search code examples
c#xml.net-4.0xml-parsingxmlreader

parsing almost well formed XML fragments: how to skip over multiple XML headers


I’m required to write a tool that can handle the below XML fragment that is not well formed as it contains XML declarations in the middle of the stream.

The company already has these kinds files in use for a long time, so there is no option to change the format.

There is no source code available that does the parsing, and the platform of choice for new tooling is .NET 4 or newer preferably with C#.

This is how the fragments look like:

<Header>
  <Version>1</Version>
</Header>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>

Using an XmlReader with the XmlReaderSettings.ConformanceLevel set to ConformanceLevel.Fragment, I can read the complete <Header> element fine. Even the <Entry> element start is OK, however while reading the <Detail> info the XmlReader it throws an XmlException, as it reads in the <?xml...?> XML declaration which it doesn't expect at that place.

What options do I have to skip over those XML declarations, besides heavy string manipulations?

Since the fragments can easily go above 100 megabyte a piece, I'd rather do not load everything into memory at once. But it that is what it takes, I am open for it.

Example of the exceptions I get:

System.Xml.XmlException: Unexpected XML declaration.
The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.
Line ##, position ##.

Solution

  • I don't think the built in classes will help; you'll probably have to do some preparsing and remove the extra headers. If your sample is accurate, you can just do a string.Replace(badXml, "<?xml version=\"1.0\"?>, "") and be on your way.