Search code examples
c#xmlsgmlreader

Weird Exception from SgmlReader


I'm using SgmlReader to parse HTML files in C#. I'm using the sample code provided on their website:

using (reader = File.OpenText(fileName))
        {
            try
            {
                xmlDoc = fromHTML(reader);
            }
            catch(Exception ex)
            {
                return ReturnedCode.ErrorOpeningHTMLFile;
            }
        }
private XmlDocument fromHTML(TextReader reader)
    {
        Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
        sgmlReader.DocType = "HTML";
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = reader;
        //  create document
        XmlDocument doc = new XmlDocument();
        doc.PreserveWhitespace = true;
        doc.Load(sgmlReader);
        return doc;
    }

The code has been running for a long time without any issue. However, recently it started throwing the following exception at doc.Load(sgmlReader) line:

A valid UTF32 value is between 0x000000 and 0x10ffff, inclusive, and should not include surrogate codepoint values (0x00d800 ~ 0x00dfff).\r\nParameter name: utf32

I was able to narrow down the problem to the below content of the HTML file. If I try to parse a file containing the below code, the exception will be thrown.

<html>
<br>&#121669935008
</html>

If I remove the ampersand in the second line, the code will work normally.

Any idea what's happening here and how can I fix it? I cannot simply remove all the ampersands in the files.


Solution

  • The & character is an escape character in XML, so you need to tack on it's unicode value at the end, every time & appears in your data, thus ensuring that there are no XML parsing errors. How you can do this is replace all & with &#038;.