I'm using SgmlReader to parse HTML files in C#. I'm using the sample code provided on their website:
using (reader = File.OpenText(fileName))
{
try
{
xmlDoc = fromHTML(reader);
}
catch(Exception ex)
{
return ReturnedCode.ErrorOpeningHTMLFile;
}
}
private XmlDocument fromHTML(TextReader reader)
{
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(sgmlReader);
return doc;
}
The code has been running for a long time without any issue. However, recently it started throwing the following exception at doc.Load(sgmlReader)
line:
A valid UTF32 value is between 0x000000 and 0x10ffff, inclusive, and should not include surrogate codepoint values (0x00d800 ~ 0x00dfff).\r\nParameter name: utf32
I was able to narrow down the problem to the below content of the HTML file. If I try to parse a file containing the below code, the exception will be thrown.
<html>
<br>�
</html>
If I remove the ampersand in the second line, the code will work normally.
Any idea what's happening here and how can I fix it? I cannot simply remove all the ampersands in the files.
The &
character is an escape character in XML, so you need to tack on it's unicode value at the end, every time &
appears in your data, thus ensuring that there are no XML parsing errors. How you can do this is replace all &
with &
.