Search code examples
htmlw3c

How is "<" in HTML handled by browsers?


In the following snippet, < gets rendered as expected in Firefox 37.0.2 and I have seen the same in many other modern browsers as well. Is this textarea specification valid HTML5? Ideally shouldn't it be &lt; by escaping the "<"

<html>
<textarea>
Hello World <
</textarea>
</html>

How does the HTML parsers distinguish between a Tag Open and "<"? Most browsers do a lot to handle errors automatically by guessing, is this one such case?

The reason I am interested in this is because when we use WYSIWYG editors in Web Apps - we save the HTML from the editors source mostly. When we Template it back for the frontend, this behaviour makes it is not mandatory to HTML Quote stuff from the backend. It works without HTML Quoting but it can cause undesired effects like freezing / infinite loop's atleast with the TinyMCE Editor's 3.5.8 version.


Solution

  • This is indeed just guessing. The proper way to use literal < in HTML is to use &lt; (and &gt; for >).

    That said, textarea is a bit specific in that it can never contain any other HTML elements - so the parser can be sure you meant literal < and not a starting tag. Of course, it breaks down for </textarea> :)

    From HTML 4 specification:

    Section 5.3.2:

    Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

    So it's not necessary for HTML 4, but it's still good practice. And of course, XHTML and / or HTML 5 may be a bit more strict.

    HTML specification is actually quite non-specific in a lot of things, which goes a long way to ensuring the browsers are incompatible with each other in (more or less) subtle ways. Your best bet is not to rely on all the things HTML allows, but only on those that are very explicit and specific. The reason is quite simple - two browsers can be 100% fully compliant with the HTML specification, and still process the same HTML in ways that make it completely useless.