Search code examples
unicodecjkcodepointsurrogate-pairsastral-plane

What are the most common non-BMP Unicode characters in actual use?


In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16.

I would've expected the answer to be Chinese and Japanese characters used in names but not included in the most widespread CJK multibyte character sets, but on the project I do most work on, the English Wiktionary, we have found that the Gothic alphabet is far more common so far.

UPDATE

I've written a couple of software tools to scan entire Wikipedias for non-BMP characters and found to my surprise that even in the Japanese Wikipedia Gothic alphabet is the most common. This is also true in the Chinese Wikipedia but it also had many Chinese characters being used up to 50 or 70 times, including "𨭎", "𠬠", and "𩷶".


Solution

  • Emoji are now the most common non-BMP characters by far. 😂, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter's public stream. It occurs more frequently than the tilde!