Search code examples
javascripthtmlencodingprototypejs

More Efficiently replace escaped unicode characters on a page?


I have a page which includes escaped Unicode characters. (For example the characters 漢字 are escaped as \u6F22\u5B57). This page shows how you can use the unescape() method to convert the escaped \u6F22\u5B57 to 漢字. I have a method that converts all of the Unicode escaped characters, but it is not very fast.

function DecodeAllUnicodeCharacters (strID)
{
    var strstr = $(strID).innerHTML;
    var arrEncodeChars = strstr.match(/\\u[0-9A-Z]{4,6}/g);
    for (var ii = 0; ii < arrEncodeChars.length; ii++) {
        var sUnescaped = eval("unescape('"+arrEncodeChars[ii]+"')");
        strstr = strstr.replace(arrEncodeChars[ii], sUnescaped);
    }
    $(strID).innerHTML = strstr;
}

The part that takes longest is setting the innerHTML here: $(strID).innerHTML = strstr; Is there a good way to replace the characters without redoing the innerHTML of the whole page?


Solution

  • The reason it is slow to set innerHTML is because that causes the browser to parse it as HTML, and if there are child elements they get recreated which is extra slow. Instead we need to find just the text nodes and selectively treat them if they contain escaped content. I base the following on a previous question and demonstrated in a fiddle.

    Element.addMethods({
        // element is Prototype-extended HTMLElement
        // nodeType is a Node.* constant
        // callback is a function where first argument is a Node
        forEachDescendant: function (element, nodeType, callback)
        {
            element = $(element);
            if (!element) return;
            var node = element.firstChild;
            while (node != null) {
                if (node.nodeType == nodeType) {
                    callback(node);
                }
    
                if(node.hasChildNodes()) {
                    node = node.firstChild;
                }
                else {
                    while(node.nextSibling == null && node.parentNode != element) {
                        node = node.parentNode;
                    }
                    node = node.nextSibling;
                }
            }
        },
        decodeUnicode: function (element)
        {
            var regex = /\\u([0-9A-Z]{4,6})/g;
            Element.forEachDescendant(element, Node.TEXT_NODE, function(node) {
                // regex.test fails faster than regex.exec for non-matching nodes
                if (regex.test(node.data)) {
                    // only update when necessary
                    node.data = node.data.replace(regex, function(_, code) {
                        // code is hexidecimal captured from regex
                        return String.fromCharCode(parseInt(code, 16));
                    });
                }
            });
        }
    });
    

    The benefit of element.addMethods, aside from aesthetics, is the functional pattern. You can use decodeUnicode several ways:

    // single element
    $('element_id').decodeUnicode();
    // or
    Element.decodeUnicode('element_id');
    
    // multiple elements
    $$('p').each(Element.decodeUnicode);
    // or
    $$('p').invoke('decodeUnicode');