Search code examples
c#xmlparsingdtdxmlreader

most efficient way in c# to parse a large Xml string (to expand DTD references, add new lines etc)


I have an interface that provides large Xml strings that are valid XML but may not be in standard form (say missing prefix for default namespace specified) or are without any line-endings or need expansion of entities in in-lined DTD. Basically I need to parse these strings with standard Xml parser that can handle in-lined DTD definitions. This string data can be anywhere from few characters to giga bytes.

At present I am using following code (and such simple parsing seems to be able to fix issues that I mentioned above):

              XDocument doc = XDocument.Parse(LargeXmlString);

                var settings = new XmlWriterSettings();
                settings.Indent = true;
                settings.Encoding = Encoding.Unicode;
                //more settings

                StringBuilder parsedOutput = new StringBuilder();
                using (XmlWriter xmlWriter =       
                          XmlWriter.Create(parsedOutput, settings))
                {
                    doc.WriteTo(xmlWriter);
                }

While this is easy to use, I am not sure how good/bad it is compared to using some other .net xml parsing classes like XmlReader/XmlTextReader or XmlDocument etc?

What is the best/most efficient way of doing this using .net/c# supported classes (possibly without writing lot of new code)?

thanks for your help

`<?xml version="1.0" encoding="UTF-8"?><Catalogue    xmlns="http://www.somewhere.org/BookCatalogue" xmlns:cat="http://www.somewhere.org/BookCatalogue" xmlns:html="http://www.somewhere.org/HTMLCatalogue" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.somewhere.org/BookCatalogue                         txjsgen14.txt"><cat:Magazine><Title>Natural Health</Title><Author>October</Author><Date>December, 1999</Date><Volume>12</Volume>.....`

gets converted to

`<?xml version="1.0" encoding="utf-8"?>
<cat:Catalogue xmlns="http://www.somewhere.org/BookCatalogue" xmlns:cat="http://www.somewhere.org/BookCatalogue" xmlns:html="http://www.somewhere.org/HTMLCatalogue" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.somewhere.org/BookCatalogue                         txjsgen14.txt">
  <cat:Magazine>
    <cat:Title>Natural Health</cat:Title>
    <cat:Author>October</cat:Author>
    <cat:Date>December, 1999</cat:Date>
    <cat:Volume>12</cat:Volume>
    <cat:htmlTable>.....`

Note the addition of cat prefix to Title and other elements based on name space declarations

Thank you all for your responses.

@ Enigmativity Sorry for the confusion i created in the confusion. Actually, i only need a string to string conversion where first string has not-so-proper XML which is not properly formatted, not expanding DTD entities, not having line delimeters and may be missing prefixes etc. While the second string should have fixed all of these things.
Now if some component (say XmlReader) can take first string as argument and make it canonical/properly formatted/expanded XML and return as a string then all I need is one component. In example above, the parsing is done by XDocument and the formatting is done by XmlWriter. and I am not even sure of who does the expansion of entities, the parser or the XmlWriter. Probably the writer.

For the time being I will try to use a combination of XmReader and XmlWriter, where XmlReader reads the first string and the XmlWriter writes the formated one (as specified by the XmlWriterSettings used for the XmlWriter). Let me know if there is any better approach.


Solution

  • You can do essentially what you have in your example, but with XmlReader:

    XmlReader xmlReader = ...;
    
    using (XmlWriter xmlWriter = ...)
    {
        xmlWriter.WriteNode(reader, true);
    }
    

    This will be the most efficient way -- streaming the document node by node vs. reading the entire thing into memory before writing it out.