I'm working on an existing web app where user-generated content is allowed to have HTML tags. To mitigate XSS attacks and other risks, I'm using the HTML Purifier library that parses the content as HTML, and removes any tags that are not on the allow list before rendering it server-side.
I'm looking to improve performance by only invoking the expensive purifier library for strings that are deemed to be risky using a cheaper test—that is, by checking for the <
character, which would indicate there could be an HTML tag in the string.
So, does the absence of the <
character in a user-generated string guarantee that the string does not contain anything that will be rendered as an HTML tag when included in an HTML document from the server? In other words, is there a substring an attacker could include that would result in, say, a <script></script>
tag being rendered by the browser without the original string containing a <
character?
The kind of thing I'm thinking of is using a character encoding trick to make the browser ultimately process a <
character as the opening of a tag without the original string including that literal character.
I am okay with interpreting HTML entities that the user inputs and rendering them as the associated characters, such as &
being rendered as &
. I just want to make sure there are no HTML tags rendered outside the allow list.
The usual approach of encoding all HTML entities (e.g. <
, >
, &
) in the string, which would always render the string exactly as the user entered it, is not an option for this app: allowing HTML tags that are on the allow list is required.
html standard requires <
symbol to indicate tags:
from
https://www.w3.org/TR/2011/WD-html5-20110405/syntax.html#syntax-start-tag
The first character of a start tag must be a U+003C LESS-THAN SIGN character (<).
so, without it - text cannot be considered as containing html markup and you can check for it existence before passing string to purifier
though, there might be existing or future browser's bugs which will accept other symbols in addition to U+003C sign