Search code examples
textunicodeencodingutf-8utf-16

Will UTF-8 strings always be shorter than UTF-16?


If I have 2 strings of the same text, one UTF-8, and the other UTF-16.
Is it safe to assume the UTF-8 string will always be smaller, or the same size, as the UTF-16 one? (byte wise)


Solution

  • No, while the UTF-8 text will usually be shorter, it's not always the case.

    Anything between U+0000 and U+FFFF will be represented with 2 bytes (one UTF-16 codepoint) in UTF-16.

    Characters between U+0800 and U+FFFF will be represented with 3 bytes in UTF-8.

    Therefore a text that contains only (or mostly) characters in that range, can easily be longer when represented in UTF-8 than in UTF-16.

    Put differently:

    • U+0000 - U+007F: UTF-8 is shorter (1 < 2)
    • U+0080 - U+07FF: Both are the same size (2 = 2)
    • U+0800 - U+FFFF: UTF-8 is longer (3 > 2)
    • U+10000 - U+10FFFF: Both are the same size (4 = 4)

    Note that 5 and 6 byte sequences used to be defined in UTF-8 but are no longer valid according to the newest standard and were never necessary to represent Unicode codepoints.