Search code examples
utf-8astral-plane

Is there a language(s) which will require three or more bytes per character when encoded using UTF-8? Which ones?


Commonly used ofc, Klingon doesnt count :-)

thanks, guys, let me run willItFit() testcases

OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again


Solution

  • Characters requiring 3 bytes start at U+0800 and all subsequent characters, so that's a HUGE number of potential characters. This includes East Asian scripts such as Japanese, Chinese, Korean, and Thai.

    For a complete list of script ranges, you can refer to Unicode's block data. Only these blocks can be represented with 1 or 2 bytes, characters from all other blocks require 3 or 4 bytes:

    0000..007F Basic Latin
    0080..00FF Latin-1 Supplement
    0100..017F Latin Extended-A
    0180..024F Latin Extended-B
    0250..02AF IPA Extensions
    02B0..02FF Spacing Modifier Letters
    0300..036F Combining Diacritical Marks
    0370..03FF Greek and Coptic
    0400..04FF Cyrillic
    0500..052F Cyrillic Supplement
    0530..058F Armenian
    0590..05FF Hebrew
    0600..06FF Arabic
    0700..074F Syriac
    0750..077F Arabic Supplement
    0780..07BF Thaana
    07C0..07FF NKo