According to its language specification JavaScript has some problems with Unicode (if I understand it correctly), as text is always handled as one character consisting of 16 bits internally.
JavaScript: The Good Parts speaks out in a similar way.
When you search Google for V8's support of UTF-8, you get contradictory statements.
So: What is the state of Unicode support in Node.js (0.10.26 was the current version when this question was asked)? Does it handle UTF-8 will all possible codepoints correctly, or doesn't it?
If not: What are possible workarounds?
The two sources you cite, the language specification and Crockford's “JavaScript: The Good Parts” (page 103) say the same thing, although the latter says it much more concisely (and clearly, if you already know the subject). For reference I'll cite Crockford:
JavaScript was designed at a time when Unicode was expected to have at most 65,536 characters. It has since grown to have a capacity of more than 1 million characters.
JavaScript's characters are 16 bits. That is enough to cover the original 65,536 (which is now known as the Basic Multilingual Plane). Each of the remaining million characters can be represented as a pair of characters. Unicode considers the pair to be a single character. JavaScript thinks the pair is two distinct characters.
The language specification calls the 16-bit unit a “character” and a “code unit”. A “Unicode character”, or “code point”, on the other hand, can (in rare cases) need two 16-bit “code units” to be represented.
All of JavaScript's string properties and methods, like length
, substr()
, etc., work with 16-bit “characters” (it would be very inefficient to work with 16-bit/32-bit Unicode characters, i.e., UTF-16 characters). E.g., this means that, if you are not careful, with substr()
you can leave one half of a 32-bit UTF-16 Unicode character alone. JavaScript won't complain as long as you don't display it, and maybe won't even complain if you do. This is because, as the specification says, JavaScript does not check that the characters are valid UTF-16, it only assumes they are.
In your question you ask
Does [Node.js] handle UTF-8 will all possible codepoints correctly, or doesn't it?
Since all possible UTF-8 codepoints are converted to UTF-16 (as one or two 16-bit “characters”) in input before anything else happens, and vice versa in output, the answer depends on what you mean by “correctly”, but if you accept JavaScript's interpretation of this “correctly”, the answer is “yes”.
For further reading and head-scratching: https://mathiasbynens.be/notes/javascript-unicode