Search code examples
javascriptxmlstringutf-8utf-16

How to decode utf-16 emoji surrogate pairs into uf8-8 and display them correctly in html?


I have a string which contains xml. It has the following substring

<Subject>&amp;#55357;&amp;#56898;&amp;#55357;&amp;#56838;&amp;#55357;&amp;#56846;&amp;#55357;&amp;#56838;&amp;#55357;&amp;#56843;&amp;#55357;&amp;#56838;&amp;#55357;&amp;#56843;&amp;#55357;&amp;#56832;&amp;#55357;&amp;#56846;</subject>    

I'm pulling the xml from a server and I need to display it to the user. I've noticed the ampersand has been escaped and there are utf-16 surrogate pairs. How do I ensure the emojis/emoticons are displayed correctly in a browser.

Currently I'm just getting these characters: �������������� instead of the actual emojis.

I'm looking for a simple way to fix this without any external libraries or any 3rd party code if possible just plain old javascript, html or css.


Solution

  • You can convert UTF-16 code units including surrogates to a JavaScript string with String.fromCharCode. The following code snippet should give you an idea.

    var str = '&amp;#55357;&amp;#56898;ABC&amp;#55357;&amp;#56838;&amp;#55357;&amp;#56846;&amp;#55357;&amp;#56838;&amp;#55357;&amp;#56843;&amp;#55357;&amp;#56838;&amp;#55357;&amp;#56843;&amp;#55357;&amp;#56832;&amp;#55357;&amp;#56846;';
    
    // Regex matching either a surrogate or a character.
    var re = /&amp;#(\d+);|([^&])/g;
    var match;
    var charCodes = [];
    
    // Find successive matches
    while (match = re.exec(str)) {
      if (match[1] != null) {
        // Surrogate
        charCodes.push(match[1]);
      }
      else {
        // Unescaped character (assuming the code point is below 0x10000),
        charCodes.push(match[2].charCodeAt(0));
      }
    }
    
    // Create string from UTF-16 code units.
    var result = String.fromCharCode.apply(null, charCodes);
    console.log(result);