Search code examples
htmlspecial-charactersspecificationshtml-entities

Is it valid to write '<' and '>' in HTML5 with spaces surrounding them or must they always be written as HTML entities?


Is the following valid HTML5?

<p>1 < 2</p>
<p>2 > 1</p>

Or must this always be written using HTML5 entities like this?

<p>1 &lt; 2</p>
<p>2 &gt; 1</p>

Can someone help me answer this question with references to the HTML5 specification that clearly spells out whether or not it is valid to write < and > (spaces around the symbols) in HTML?


Solution

  • > in intended text content is and always had been safe and valid in HTML, even without spacing.

    < is technically invalid when it does not constitute tag in context where tags are expected. Slightly simplified: When parser encounters it in "Data state", it switches parser to state that either expects valid tag name ("Tag open state") or other markup-related characters (/ for closing tag or ! for either doctype or comment).

    Valid HTML tag name must start with letter ("[a-z] case insensitive"), so encountering space character there instead results in Error state: "invalid-first-character-of-tag-name" that instructs parser to handle it so that

    such code point and a preceding U+003C (<) is treated as text content, and all content that follows is treated as markup.

    So like all other similar syntactic errors in HTML, it has a clear canonical recovery handling that conformant interpreters have to follow. In effect it at the same time produces "invalid" state, but has predictable and standardized outcome as well, so one might consider is 'safe' to exploit: in this case sequence of < , i.e. Bad character after <, rolls back to text content ("Data state"), adds the < and that "bad character" ( ) into its value, and proceeds further. In the end it is displayed the same way as if it was encoded as &lt; .

    You can verify that by validating sample document

    <!doctype html><html lang="en">
    
    <title>a > b < c</title>
    
    <p>a > b < c</p>
    
    <textarea>a > b < c</textarea>
    

    in validator.w3.org/nu/. It yields:

    Error: Bad character  after <.
    Probable cause: Unescaped <.
    Try escaping it as `&lt;`. At line 5, column 11
    

    N.B. in title and textarea < is OK, since there cannot be any nested non-text nodes (not even comments) because these are (IIUC) specified as Raw Text content.