How to escape HTML

I have a string which contains HTML text. I need to escape just the strings and not tags. For example, I have string which contains,

<ul class="main_nav">
  <li>
    <a class="className1" id="idValue1" tabindex="2">Test & Sample</a>
  </li>
 <li>
  <a class="className2" id="idValue2" tabindex="2">Test & Sample2</a>
  </li>
</ul>

How to escape just the text to,

<ul class="main_nav">
  <li>
    <a class="className1" id="idValue1" tabindex="2">Test &amp; Sample</a>
  </li>
  <li>
    <a class="className2" id="idValue2" tabindex="2">Test &amp; Sample2</a>
  </li>
</ul>

with out modifying the tags.

Can this be handled with HTML DOM and javascript?

Solution

(See further down for an answer to the question as updated by comments from the OP below)

Can this be handled with HTML DOM and javascript?

No, once the text is in the DOM, the concept of "escaping" it doesn't apply. The HTML source text needs to be escaped so that it's parsed into the DOM correctly; once it's in the DOM, it isn't escaped.

This can be a bit tricky to understand, so let's use an example. Here's some HTML source text (such as in an HTML file that you would view with your browser):

<div>This &amp; That</div>

Once that's parsed into the DOM by the browser, the text within the div is This & That, because the & has been interpreted at that point.

So you'll need to catch this earlier, before the text is parsed into the DOM by the browser. You can't handle it after the fact, it's too late.

Separately, the string you're starting with is invalid if it has things like <div>This & That</div> in it. Pre-processing that invalid string will be tricky. You can't just use built-in features of your environment (PHP or whatever you're using server-side) because they'll escape the tags as well. You'll need to do text processing, extracting only the parts that you want to process and then running those through an escaping process. That process will be tricky. An & followed by whitespace is easy enough, but if there are unescaped entities in the source text, how do you know whether to escape them or not? Do you assume that if the string contains &, you leave it alone? Or turn it into &amp;? (Which is perfectly valid; it's how you show the actual string & in an HTML page.)

What you really need to do is correct the underlying problem: The thing creating these invalid, half-encoded strings.

Edit: From our comment stream below, the question is totally different than it seemed from your example (that's not meant critically). To recap the comments for those coming to this fresh, you said that you were getting these strings from WebKit's innerHTML, and I said that was odd, innerHTML should encode & correctly (and pointed you at a couple of test pages that suggested it did). Your reply was:

This works for &. But the same test page do not work for entities like ©, ®, « and many more.

That changes the nature of the question. You want to make entities out of characters that, while perfectly valid when used literally (provided you have your text encoding right), could be expressed as entities instead and therefore made more resilient to text encoding changes.

We can do that. According to the spec, the character values in a JavaScript string are UTF-16 (using Unicode Normalized Form C) and any conversion from the source character encoding (ISO 8859-1, Windows-1252, UTF-8, whatever) is performed before the JavaScript runtime sees it. (If you're not 100% sure you know what I mean by character encoding, it's well worth stopping now, going off and reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, then coming back.) So that's the input side. On the output side, HTML entities identify Unicode code points. So we can convert from JavaScript strings to HTML entities reliably.

The devil is in the detail, though, as always. JavaScript explicitly assumes that each 16-bit value is a character (see section 8.4 in the spec), even though that's not actually true of UTF-16 — one 16-bit value might be a "surrogate" (such as 0xD800) that only makes sense when combined with the next value, meaning that two "characters" in the JavaScript string are actually one character. This isn't uncommon for far Eastern languages.

So a robust conversion that starts with a JavaScript string and results in an HTML entity can't assume that a JavaScript "character" actually equals a character in the text, it has to handle surrogates. Fortunately, doing so is dead easy because the smart people defining Unicode made it dead easy: The first surrogate value is always in the range 0xD800-0xDBFF (inclusive), and the second surrogate is always in the range 0xDC00-0xDFFF (inclusive). So any time you see a pair of "characters" in a JavaScript string that match those ranges, you're dealing with a single character defined by a surrogate pair. The formulae for converting from the pair of surrogate values to a code point value are given in the above links, although fairly obtusely; I find this page much more approachable.

Armed with all of this information, we can write a function that will take a JavaScript string and search for characters (real characters, which may be one or two "characters" long) you might want to turn into entities, replacing them with named entities from a map or numeric entities if we don't have them in our named map:

// A map of the entities we want to handle.
// The numbers on the left are the Unicode code point values; their
// matching named entity strings are on the right.
var entityMap = {
    "160": "&nbsp;",
    "161": "&iexcl;",
    "162": "&#cent;",
    "163": "&#pound;",
    "164": "&#curren;",
    "165": "&#yen;",
    "166": "&#brvbar;",
    "167": "&#sect;",
    "168": "&#uml;",
    "169": "&copy;",
    // ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
    "8364": "&euro;"    // Last one must not have a comma after it, IE doesn't like trailing commas
};

// The function to do the work.
// Accepts a string, returns a string with replacements made.
function prepEntities(str) {
    // The regular expression below uses an alternation to look for a surrogate pair _or_
    // a single character that we might want to make an entity out of. The first part of the
    // alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
    // alone, it searches for the surrogates. The second part of the alternation you can
    // adjust as you see fit, depending on how conservative you want to be. The example
    // below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
    // character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
    // it's not "printable ASCII" (in the old parlance), convert it. That's probably
    // overkill, but you said you wanted to make entities out of things, so... :-)
    return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
        var high, low, charValue, rep

        // Get the character value, handling surrogate pairs
        if (match.length == 2) {
            // It's a surrogate pair, calculate the Unicode code point
            high = match.charCodeAt(0) - 0xD800;
            low  = match.charCodeAt(1) - 0xDC00;
            charValue = (high * 0x400) + low + 0x10000;
        }
        else {
            // Not a surrogate pair, the value *is* the Unicode code point
            charValue = match.charCodeAt(0);
        }

        // See if we have a mapping for it
        rep = entityMap[charValue];
        if (!rep) {
            // No, use a numeric entity. Here we brazenly (and possibly mistakenly)
            rep = "&#" + charValue + ";";
        }

        // Return replacement
        return rep;
    });
}

You should be fine passing all of the HTML through it, since if these characters appear in attribute values, you almost certainly want to encode them there as well.

I have not used the above in production (I actually wrote it for this answer, because the problem intrigued me) and it is totally supplied without warrantee of any kind. I have tried to ensure that it handles surrogate pairs because that's necessary for far Eastern languages, and supporting them is something we should all be doing now that the world has gotten smaller.

Complete example page:

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Test Page</title>
<style type='text/css'>
body {
    font-family: sans-serif;
}
#log p {
    margin:     0;
    padding:    0;
}
</style>
<script type='text/javascript'>

// Make the function available as a global, but define it within a scoping
// function so we can have data (the `entityMap`) that only it has access to
var prepEntities = (function() {

    // A map of the entities we want to handle.
    // The numbers on the left are the Unicode code point values; their
    // matching named entity strings are on the right.
    var entityMap = {
        "160": "&nbsp;",
        "161": "&iexcl;",
        "162": "&#cent;",
        "163": "&#pound;",
        "164": "&#curren;",
        "165": "&#yen;",
        "166": "&#brvbar;",
        "167": "&#sect;",
        "168": "&#uml;",
        "169": "&copy;",
        // ...and lots and lots more, see http://www.w3.org/TR/REC-html40/sgml/entities.html
        "8364": "&euro;"    // Last one must not have a comma after it, IE doesn't like trailing commas
    };

    // The function to do the work.
    // Accepts a string, returns a string with replacements made.
    function prepEntities(str) {
        // The regular expression below uses an alternation to look for a surrogate pair _or_
        // a single character that we might want to make an entity out of. The first part of the
        // alternation (the [\uD800-\uDBFF][\uDC00-\uDFFF] before the |), you want to leave
        // alone, it searches for the surrogates. The second part of the alternation you can
        // adjust as you see fit, depending on how conservative you want to be. The example
        // below uses [\u0000-\u001f\u0080-\uFFFF], meaning that it will match and convert any
        // character with a value from 0 to 31 ("control characters") or above 127 -- e.g., if
        // it's not "printable ASCII" (in the old parlance), convert it. That's probably
        // overkill, but you said you wanted to make entities out of things, so... :-)
        return str.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[\u0000-\u001f\u0080-\uFFFF]/g, function(match) {
            var high, low, charValue, rep

            // Get the character value, handling surrogate pairs
            if (match.length == 2) {
                // It's a surrogate pair, calculate the Unicode code point
                high = match.charCodeAt(0) - 0xD800;
                low  = match.charCodeAt(1) - 0xDC00;
                charValue = (high * 0x400) + low + 0x10000;
            }
            else {
                // Not a surrogate pair, the value *is* the Unicode code point
                charValue = match.charCodeAt(0);
            }

            // See if we have a mapping for it
            rep = entityMap[charValue];
            if (!rep) {
                // No, use a numeric entity. Here we brazenly (and possibly mistakenly)
                rep = "&#" + charValue + ";";
            }

            // Return replacement
            return rep;
        });
    }

    // Return the function reference out of the scoping function to publish it
    return prepEntities;
})();

function go() {
    var d = document.getElementById('d1');
    var s = d.innerHTML;
    alert("Before: " + s);
    s = prepEntities(s);
    alert("After: " + s);
}

</script>
</head>
<body>
<div id='d1'>Copyright: &copy; Yen: &yen; Cedilla: &cedil; Surrogate pair: &#65536;</div>
<input type='button' id='btnGo' value='Go' onclick="return go();">
</body>
</html>

There I've included the cedilla as an example of converting to a numeric entity rather than a named one (since I left cedil out of my very small example map). And note that the surrogate pair at the end shows up in the first alert as two "characters" because of the way JavaScript handles UTF-16.