Search code examples
htmlhtml-entitieshtml-encodehtml-escape-characters

Which characters need to be escaped in HTML?


Are they the same as XML, perhaps plus the space one ( )?

I've found some huge lists of HTML escape characters but I don't think they must be escaped. I want to know what needs to be escaped.


Solution

  • Short answer

    If you're putting the text in a safe location in a document that uses a fully-Unicode-compatible text encoding like UTF-8, HTML only requires the same five characters to be escaped as XML: the ampersand & as &amp;, the less-than sign < as &lt;, the greater-than sign > as &gt;, the double-quote " as &quot;, and the single-quote ' as &#39;. Safe locations are directly in the contents of most tags (<p>username: HERE</p>), and inside of quoted attribute values (<a href="/user/HERE">). The contents of <script> and <style> tags are not safe locations. Other unsafe locations include unquoted attribute values, tag names, attribute names, doctype declarations, XML declarations, XML processing rules, and CDATA sections.

    function htmlEscape(text) {
      return String(text)
        .replaceAll("&", "&amp;")
        .replaceAll("<", "&lt;")
        .replaceAll(">", "&gt;")
        .replaceAll('"', "&quot;")
        .replaceAll("'", "&#39;");
    }
    

    Nuance

    Document text encoding

    These days almost every document is encoded using the fully-Unicode-compatible UTF-8 text encoding, which may be indicated with a meta tag <meta charset="utf-8" /> or HTTP header Content-Type: text/html; charset=utf-8. In that case, no other characters require escaping. However, if the document is using an older encoding (such as US-ASCII), you will also need to escape any characters that aren't supported by that encoding. For example, you may need to encode "😅" as &#x1F605;. If the document's encoding isn't specified, the safest option is to treat it as US-ASCII, but in most cases it should be specified and you shouldn't need to handle this.

    function asciiHtmlEscape(text) {
      return htmlEscape(text).replace(/[^\x00-\x7F]/gu, char =>
        `&#x${char.codePointAt(0).toString(16).toUpperCase()};`
      );
    }
    

    Non-breaking spaces

    In general, you should not escape spaces as &nbsp;. &nbsp; is not a normal space, it's a non-breaking space. You can use these instead of normal spaces to prevent a line break from being inserted between two words, or to insert          extra        space       without it being automatically collapsed, but this is a infrequent case. Don't do this unless you have a specific requirement that demands it.

    Contexts with narrower requirements

    The five characters listed above are sufficient to encode text in any safe location for both HTML and XML documents. For simplicity, compatibility, and to reduce the chance of mistakes, it's common to escape all of them in all cases. However, the actual requirements are narrower, and context-aware escaping logic may choose to escape fewer characters.

    • & needs to be escaped in all cases. (The spec says it only needs to be escaped when it's an "ambiguous ampersand", i.e. followed by one or more ASCII alphanumeric characters and then a semicolon, but for simplicity and compatibility with less-compliant parsers it's practically always escaped.)
    • < needs to be escaped when it appears in tag contents (<b>x &lt; y</b>), but not when it appears in quoted attribute values (<b title="x < y">).
    • " and ' need to be escaped when they appear in attribute values that are quoted using the same character (<b title="he said &quot;that's hers&quot;">, <b title='he said "that&#39;s hers"'>), but not when they appear in tag contents (<b>he said "that's hers"</b>).
    • > doesn't actually need to be escaped in modern HTML syntax, but it is still common to do so for maximum compatibility. It does need to be escaped in tag contents for XML (including the XML Syntax for HTML, a.k.a. XHTML) and older versions of HTML (before HTML 5) when it appears in the sequence ]]>. It also needed to be escaped in quoted attribute values due to bugs in some browsers' HTML parsers in the 1990s.
    • The contents of <textarea> and <title> tags are a special case where escaping is supported but optional except for & and for the closing tag (</textarea> or </title>). However, because the normal escaping rules still work, these are usually just treated the same as any other safe tag.

    Unsafe attributes

    This answer is only considering safety and correctness from a text encoding perspective. However, there are some attributes, such as those like onclick that are used for binding event handlers, where the values themselves have special meaning, and inserting arbitrary content can create an XSS security vulnerability. Similarly, if you're encoding a value to include as a parameter in a URL's query string in a link's href attribute, you need to make sure it's percent-encoded first (such as with JavaScript's encodeURIComponent). Considerations like that are out-of-scope for this answer.

    Unsafe locations

    As mentioned, this answer only considers escaping text that will be included in "safe" locations: tag contents (excluding <script> and <style> tags) and quoted attribute values. You should almost never need to include dynamic untrusted text in other locations. If you really do need to, please exercise extreme caution, prefer validating specified allowed values instead of escaping arbitrary values, and read the Open Web Application Security Project's XSS Prevention Cheat Sheet. Some considerations for escaping text for unsafe locations are described below, but these descriptions are just meant as a starting point, and may not exhaustively cover everything necessary for correctness/safety.

    Tag and attribute names

    There is no mechanism for escaping tag and attribute names. If you really to use a dynamic value in these names, the best you can do is to check for any characters that aren't allowed and either filter them out or throw an error.

    script tag contents

    Sometimes people try to escape values to include in the contents of <script> tags (which do not support normal entity escaping) by encoding them with JSON, and then escaping < as \u003C when it appears in the sequence </script>. This is safe for primitive values, but can run into problems when stringifying objects due to the special behaviour of the "__proto__" key name when it appears in an object literal. Instead of trying to include these values in the script directly, you should include them in a quoted attribute value and read that attribute from the script instead. This will also avoid the difficulties that come with dynamic script contents if your document uses a Content Security Policy.

    style tag contents

    There's no general useful way to escape values to include in the contents of <style> tags (which also do not support normal entity encoding), and it would also create difficulties if you're using a Content Security Policy. If you're trying to include a value as part of a CSS selector, you might be able to use the approach of CSS.escape. If you want to include a simple number (not NaN or an Infinity), might be able simply convert it to a string. If you want to include a value in a CSS string expression, you might be able to use CSS's own string escaping rules plus a special case to ensure < is encoded when it appears in the sequence </style>. Consider alternatives instead.

    Comments

    Comments are an unusual case. If you escape < and > and your value isn't immediately following the sequence <! or immediately preceding the sequence > or !>, the value will be properly contained by the comment and won't affect the rest of the document. However, in terms of the DOM API, comments node do actually have a data value, just like text nodes, but they don't support any escape sequences. That means that the data/node value of the comment <!--&gt;--> is &gt;, not >, and there's no way to avoid that.

    CDATA sections

    In CDATA sections, the only thing that needs to be escaped is the closing sequence ]]> by splitting into two CDATA sections in the middle, replacing it with ]]]]><![CDATA[>. However, the only place a CDATA section can appear in modern HTML syntax is within the <svg> and <math> tags, which follow more XML-like rules.

    Unquoted attributes

    Sometimes people think that escaping > is useful in case the value is used in an unquoted attribute value by accident. While that can help in some very specific cases, actually safely escaping unquoted attribute values requires a dozen other characters to be escaped, including spaces, and nobody is doing that, so escaping > only provides very marginal protection there. The only good reasons to escape it are for compatibility with XML and with bad parsers, or if the value is being used inside of a comment.

    Other escaped representations

    The escaped representations used above are the most common, and there's no need to use anything else. However, other representations are sometimes used, and all representations are case-insensitive.

    • & may be escaped as &amp;, &#38; or &#x26;.
    • < may be escaped as &lt;, &#60; or &#x3C;.
    • > may be escaped as &gt;, &#62; or &#x3E;.
    • " may be escaped as &quot;, &#34; or &#x22;.
    • ' may be escaped as &#39;, &#x27; or &apos;, but &apos; wasn't valid in HTML prior to HTML 5.