Search code examples
c#html.nethtml-agility-pack

Does HtmlAgilityPack implement the standard HTML5 parsing algorithm?


When I parse an HTML5 document such as:

<p>Content</p>

using HtmlAgilityPack with default options, it parses it successfully, but the constructed HtmlDocument does not include the <html> and <body> elements that the standard HTML5 parsing algorithm would construct.

Are there options I am missing that would do this?

Or is there some other library (.NET 6) that I should be using instead?


Solution

  • I have come to the conclusion that unless the functionality is very well hidden, HtmlAgilityPack does not offer this capability.

    I discovered the package AngleSharp, which seems to meet my requirement.

    Well, almost. Parsing <p>Content</p>, I get

    <?xml version="1.0" encoding="UTF-8"?>
    <HTML xmlns="http://www.w3.org/1999/xhtml"><HEAD/>
    <BODY><P>Content</P></BODY></HTML>
    

    I need to do a bit of further work to get the element names in lower case, but we're close.