Search code examples
phphtmlpurifierhtml-sanitizing

HTML Purifier - Escape disallowed tags instead of stripping


I'm using HTML Purifier to sanitize user input. I have a list of allowed elements configured, which means that any tag not in the allowed list is stripped. Code below:

require_once "HTMLPurifier.standalone.php";
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.AllowedElements', array('strong','b','em','i'));
$purifier = new HTMLPurifier($config);
$safe_html = $purifier->purify($dirty_html));

Rather than only retaining their contents, I would like the elements that are not included in the list to be escaped and sent back as text.


To illustrate, given the white list shown above, the following input string:

<a href="javascript:alert('XSS')"><strong>CLAIM YOUR PRIZE</strong></a>

turns into "<strong>CLAIM YOUR PRIZE</strong>", because a is not whitelisted. Similarly,

<b>Check the article <a href="http://example.com/">here</a></b>

becomes "<b>Check the article here</b>".

Is there a way to turn the above two examples into the following:

&lt;a href="javascript:alert('XSS')"&gt;<strong>CLAIM YOUR PRIZE</strong>&lt;/a&gt;
<b>Check the article &lt;a href="http://example.com/"&gt;here&lt;/a&gt;</b>

purely by adjusting HTML Purifier's configuration without resorting to regular expression-based "hacks"? If there is, then I'd like to know how it's done.


Solution

  • The setting Core.EscapeInvalidTags should be what you're looking for:

    require_once(__DIR__ . '/library/HTMLPurifier.auto.php');
    
    $dirty_html = '<a href="javascript:alert(\'XSS\')"><strong>CLAIM YOUR PRIZE<div></div></strong></a>';
    
    $config = HTMLPurifier_Config::createDefault();
    $config->set('HTML.AllowedElements', array('strong','b','em','i'));
    $config->set('Core.EscapeInvalidTags', true);
    $purifier = new HTMLPurifier($config);
    $safe_html = $purifier->purify($dirty_html);
    
    echo $safe_html . PHP_EOL;
    

    ...gives:

    &lt;a href="javascript:alert('XSS')"&gt;<strong>CLAIM YOUR PRIZE&lt;div /&gt;</strong>&lt;/a&gt;
    

    I threw in the invalid child element <div></div> there so you can see what happens: HTML Purifier will still 'alter' the original HTML due to parsing it (<div></div> becomes <div />), but the information remains (and is converted to &lt;div /&gt;).