Search code examples
c#xmlwell-formed

What is the fastest way to programmatically check the well-formedness of XML files in C#?


I have large batches of XHTML files that are manually updated. During the review phase of the updates I would like to programmatically check the well-formedness of the files. I am currently using a XmlReader, but the time required on an average CPU is much longer than I expected.

The XHTML files range in size from 4KB to 40KB and verifying takes several seconds per file. Checking is essential but I would like to keep the time as short as possible as the check is performed while files are being read into the next process step.

Is there a faster way of doing a simple XML well-formedness check? Maybe using external XML libraries?


I can confirm that validating "regular" XML based content is lightning fast using the XmlReader, and as suggested the problem seems to be related to the fact that the XHTML DTD is read each time a file is validated.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Note that in addition to the DTD, corresponding .ent files (xhtml-lat1.ent, xhtml-symbol.ent, xhtml-special.ent) are also downloaded.

Since ignoring the DTD completely is not really an option for XHTML as the well-formedness is closely linked to allowed HTML entities (e.g., a &nbsp; will promptly introduce validation errors when we ignore the DTD).


The problem was solved by using a custom XmlResolver as suggested, in combination with local (embedded) copies of both the DTD and entity files.

I will post the solution here once I cleaned up the code


Solution

  • I would expect that XmlReader with while(reader.Read)() {} would be the fastest managed approach. It certainly shouldn't take seconds to read 40KB... what is the input approach you are using?

    Do you perhaps have some external (schema etc) entities to resolve? If so, you might be able to write a custom XmlResolver (set via XmlReaderSettings) that uses locally cached schemas rather than a remote fetch...

    The following does ~300KB virtually instantly:

        using(MemoryStream ms = new MemoryStream()) {
            XmlWriterSettings settings = new XmlWriterSettings();
            settings.CloseOutput = false;
            using (XmlWriter writer = XmlWriter.Create(ms, settings))
            {
                writer.WriteStartElement("xml");
                for (int i = 0; i < 15000; i++)
                {
                    writer.WriteElementString("value", i.ToString());
                }
                writer.WriteEndElement();
            }
            Console.WriteLine(ms.Length + " bytes");
            ms.Position = 0;
            int nodes = 0;
            Stopwatch watch = Stopwatch.StartNew();
            using (XmlReader reader = XmlReader.Create(ms))
            {
                while (reader.Read()) { nodes++; }
            }
            watch.Stop();
            Console.WriteLine("{0} nodes in {1}ms", nodes,
                watch.ElapsedMilliseconds);
        }