HTML Purifier: disable syntax repair

Consider the following setup of HTML Purifier:

require_once 'library/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('Core.EscapeInvalidTags', true);
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);

If you run the following case:

$dirty_html = "<p>lorem <script>ipsum</script></p>";

//output
<p>lorem &lt;script&gt;ipsum&lt;/script&gt;</p>

As expected, instead of removing the invalid tags, it just escaped them all.

However, consider these other test cases:

case 1

$dirty_html = "<p>lorem <b>ipsum</p>";

//output
<p>lorem <b>ipsum</b></p>

//desired output
<p>lorem &lt;b&gt;ipsum</p>

case 2

$dirty_html = "<p>lorem ipsum</b></p>";

//output
<p>lorem ipsum</p>

//desired output
<p>lorem ipsum&lt;/b&gt;</p>

case 3

$dirty_html = "<p>lorem ipsum<script></script></p>";

//output
<p>lorem ipsum&lt;script /&gt;</p>

//desired output
<p>lorem ipsum&lt;script&gt;&lt;/script&gt;</p>

Instead of just escaping the invalid tags, first it repairs them and then escapes them. This way things can get very strange, for example:

case 4

$dirty_html = "<p><a href='...'><div>Text</div></a></p>";

//output
<p><a href="..."></a></p><div><a href="...">Text</a></div><a href="..."></a>&lt;/p&gt;

Question
Therefore, is it possible to disable the syntax repair and just escape the invalid tags?

Solution

The reason you're seeing a syntax repair is because of the fundamental way that HTML Purifier approaches the topic of HTML sanitation: It first parses the HTML to understand it, then decides which of the elements to keep in the parsed representation, then renders the HTML.

You might be familiar with one of stackoverflow's most famous answers, which is an amused and exasperated observation that true regular expressions can't parse HTML - you need additional logic, since HTML is a context-free language, not a regular language. (Modern 'regular' expressions are not formal regular expressions, but that's another matter.) In other words, if you actually want to know what's going on in your HTML - so that you correctly apply your white- or blacklisting - you need to parse it, which means the text ends up in a totally different representation.

An example of how parsing causes changes between input and output is that HTML Purifier strips extraneous whitespace from between attributes, which may not bother you in your case, but still stems from that the parsed representation of HTML is quite different from the text representation. It's not trying to preserve the form of your input - it's trying to preserve the function.

This gets tricky when there is no clear function and it has to start guessing. To pick an example, imagine while going through the HTML input, you come across what looks like an opening <td> tag in the middle of nowhere - you can consider it valid if there was an unclosed <td> tag a while back as long as you add a closing tag, but if you had escaped the first tag as <td>, you would need to discard the text data that would have been in the <td> since - depending on browser rendering - it may put data into parts of the page visually outside the fragment, i.e. places that are not clearly user-submitted.

In brief: You can't easily disable all syntax repair and/or tidying without having to rummage through the parsing guts of HTML Purifier and ensuring no information you find valuable is lost.

That said, you can try switching the underlying parsing engine with Core.LexerImpl and see if it gets you better results! :) DOMLex definitely adds missing ending nodes right from the get-go, but from a cursory glance, DirectLex may not. There is a large chunk of autoclosing logic in HTMLPurifier's MakeWellFormed strategy class which might also pose a problem for you.

Depending on why you want to preserve this data, though (to allow analysis?), saving the original input separately (while leaving HTML Purifier itself be) may provide you with a better solution.