Search code examples
javascriptstringunicodecodepointsurrogate-pairs

Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")


Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode).

JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane).

To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively.

I'm looking for how to split a js string by codepoint, whether the codepoints require one or two JavaScript "characters" (code units).

Depending on your needs, splitting by codepoint might not be enough, and you might want to split by "grapheme cluster", where a cluster is a base codepoint followed by all its non-spacing modifier codepoints, such as combining accents and diacritics.

For the purposes of this question I do not require splitting by grapheme cluster.


Solution

  • @bobince's answer has (luckily) become a bit dated; you can now simply use

    var chars = Array.from( text )
    

    to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.