How to Decode UTF-8 Text Sequence \ud83e\udd14

I'm reading UTF-8 text that contains "\ud83e\udd14". Reading the specification, it says that U+D800 to U+DFFF are not used. Yet if I run this through a decoder such as Microsoft's System.Web.Helpers.Json.Decode, it yields the correct result of an emoticon of a face with a tongue hanging out. The text originates through Twitter's search api.

My question: how should this sequence be decoded? I'm looking for what the final hex sequence would be and how it is obtained. Thanks for any guidance. If my question isn't clear, please let me know and I will try to improve it.

Solution

You are coming at this from an interesting perspective. The first thing to note is that you're dealing with two levels of text: a JSON document and a string within it.

Synopsis: You don't need to write code to decode it. Use a library that deserializes JSON into objects, such as Newtonsoft's JSON.Net.

But, first, Unicode. Unicode is a character set with a bit of a history. Unlike almost every character set, 1) it has more than one encoding, and 2) it is still growing. A couple of decades ago, it had <65636 codepoints and that was thought to be enough. So, encoding each codepoint with as 2-byte integer was the plan. It was called UCS-2 or, simply, the Unicode encoding. (Microsoft has stuck with Encoding.Unicode in .NET, which causes some confusion.)

Aside: Codepoints are identified for discussion using the U+ABCD (hexadecimal) format.

Then the Unicode consortium decided to add more codepoints: all the way to U+10FFFF. For that, encodings need at least 21 bits. UTF-32, integers with 32 bits, is an obvious solution but not very dense. So, encodings that use a variable number of code units where invented. UTF-8 uses one to four 8-bit code units, depending on the codepoint.

But a lot of languages were adopting UCS-2 in the 1990s. Documents, of course, can be transformed at will but code that processes UCS-2 would break without a compatible encoding for the expanded character set. Since U+D800 to U+DFFF where unassigned, UCS-2 could stay the same and those "surrogate codepoints" could be used to encode new codepoints. The result is UTF-16. Each codepoint is encoded in one or two 16-bit code units. So, programs that processed UCS-2 could automatically process UTF-16 as long as they didn't need to understand it. Programs written in the same system could be considered to be processing UTF-16, especially with libraries that do understand it. There is still the hazard of things like string length giving the number of UTF-16 code units rather than the number of codepoints, but it has otherwise worked out well.

As for the \ud83e\udd14 notation, languages use Unicode in their syntax or literal strings desired a way to accept source files in a non-Unicode encoding and still support all the Unicode codepoints. Being designed in the 1990s, they simply wrote the UCS-2 code units in hexadecimal. Of course, that too is extended to UTF-16. This UTF-16 code unit escaped syntax allows intermediary systems to handle source code files with a non-Unicode encoding.

Now, JSON is based on JavaScript and JavaScript's strings are sequences of UTF-16 code units. So JSON has adopted th UTF-16 code unit escaped syntax from JavaScript. However, it's not very useful (unless you have to deal with intermediary systems that can't be made to use UTF-8 or treat files they don't understand as binary). The old JSON standard requires JSON documents exchanged between systems to be encoded with UTF-8, UTF-16 or UTF-32. The new RFC8259 requires UTF-8.

So, you don't have "UTF-8 text", you have Unicode text encoding with UTF-8. The text itself is a JSON document. JSON documents have names and values that are Unicode text as sequences of UTF-16 code units with escapes allowed. Your document has the codepoint U+1F914 written, not as "🤔" but as "\ud83e\udd14".

There are plenty of libraries that transform JSON to objects so you shouldn't need to decode the names or values in a JSON document. To do it manually, you'd recognize the escape prefix and take the next 4 characters as the bits of a surrogate, extracting the data bits, then combine them with the bits from the paired surrogate that should follow.