Search code examples
c#xhtmlhtml-agility-pack

HTML Agility Pack (C#) malforms my code


I'm currently coding a desktop application in c# which also has to handle XHTML document manipulation. For that purpose I'm using the Html Agility Pack which seemed to be okay so far. After carefully checking the output from Html Agility Pack I found out that the code isn't well formed xhtml any more.

It removes self-closing tags (slash) and overwrites other proprietary code elements...

eg. input html code:

<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)" />

eg. output html code

<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)">

(removed the trailing slash...)

Another example is with proprietary code elements (for Mikrotik hotspot devices):

eg input html code

<form action="$(link-login-only)" method="post" name="login" $(if chap-id) onSubmit="return doLogin()"$(endif)>

The $(if chap-id), $(endif) and $(link-login-only) parts are custom code fragments interpreted from the Mikrotik device.

eg. output html code after Html Agility Pack (which transforms it to unuseable code)

<form action="$(link-login-only)" method="post" name="login" $(if="" chap-id)="" onsubmit="return doLogin()" $(endif)="">

Has someone an idea how to "instruct" Html Agility Pack to output well formed XHTML and to ignore "custom code" fragments (is this possibly via Regex)?

Thanks in advance! :-)


Solution

  • In your first example, HTML Agility Pack is actually fixing your markup. The input element is a void element. Since there is no context inside, it needs no closing tag.

    HTML Agility Pack is made for parsing valid HTML markup, not markup embedded with custom code. In your first example, the custom markup is inside quotes therefore isn't an issue. In your second example, the variables are outside quotes.

    HTML Agility Pack tries to parse them as regular (but malformed) attributes of the element. There's no way to fix that. You'll have to find another way to parse your markup if you need support for custom code inside the markup.