Search code examples
c#html-agility-packnon-well-formedxml-declarationend-tag

OptionWriteEmptyNodes break XML declaration using HtmlAgilityPack


Here is the super simple code i have:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.Load("sourcefilepath");
htmlDoc.Save("destfilepath", Encoding.UTF8);

Input:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <link rel="stylesheet" href="main.css" type="text/css"/>
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

Output:

<?xml version="1.0" encoding="UTF-8" />
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
    <link rel="stylesheet" href="main.css" type="text/css" />
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

You can see that in the first line there is an error: /> instead of ?> This happens if i set OptionWriteEmptyNodes to true value. It has been set to true, because otherwise meta/link tags(and some others in the document body) won't be closed.

Anyone know how to solve this?


Solution

  • Seems like a bug. You should report it to http://htmlagilitypack.codeplex.com.

    Still, you can workaround that bug like this:

    HtmlNode.ElementsFlags.Remove("meta");
    HtmlNode.ElementsFlags.Remove("link");
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.Load("sourcefilepath");
    htmlDoc.Save("destfilepath", Encoding.UTF8);
    

    Just remove the flags from the meta & link tags that instruct the Html Agility Pack not to close them automatically, and don't set OptionWriteEmptyNodes to true.

    It will produce this (note this is slightly different):

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
        <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"></meta>
        <link rel="stylesheet" href="main.css" type="text/css"></link>
      </head>
      <body>lots of text here, obviously not relevant to this problem</body>
    </html>