Search code examples
unicodeencodingcharacter-encodingutf-32

How can there be a fixed width Unicode encoding?


I have heard many times, when reading about Unicode, that UTF-32 is a fixed width encoding.

Taking fixed width encoding to mean "a code which maps source symbols to a set number of bits," and, assuming that the source symbols in question are Unicode code points, this all makes sense. However, if you think of the underlying language of source symbols being graphemes, things get a lot more complicated.

So my question is this, in the sense of graphemes, is UTF-32 truly a fixed length encoding? And if not, is there a possible fixed length encoding in that sense?


Solution

  • One of the comments referenced Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) article, which was written in 2003. At the time, it served as a wake-up call (it probably still does in some places). However, it is not without its (minor, but significant) technical problems — though the overall thesis ('you need to know about Unicode, and you need to know which encoding a string is in') remains valid. The comment then continued:

    And yes, UTF-16 and UTF-32 are both fixed width. UTF-8 … isn't.

    UTF-16 isn't really fixed width; some Unicode code points are one 16-bit code unit, others require two 16-bit code units — just like UTF-8 isn't fixed width; some Unicode code points require one 8-bit code units, others require two, three or even four 8-bit code units (but not five or six, despite the comment from Joel's article that mentions the possibility). UTF-32, on the other hand, is fixed width; all Unicode code points can be encoded in a single 32-bit code unit. (Indeed, the maximum possible Unicode code point is U+10FFFF, so Unicode is a 21-bit code set, though it does not use all possible combinations of 21 bits.)

    However, code points are not identical to characters, let alone graphemes. The Unicode FAQ has a section on Characters and Combining Marks that discusses graphemes, referencing the glossary definition.

    The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

    Graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes.

    Q: How are characters counted when measuring the length or position of a character in a string?

    A: Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.

    To address the question here:

    If you mean something to do with 'it can take multiple Unicode code points to get a complete character (grapheme) with associated diacritics (combining markers, etc.)' then yes, even UTF-32 isn't necessarily fixed width and there is no fixed width encoding for Unicode.

    UTF-32 employs a fixed-width encoding for each Unicode code point, but since it can take multiple code points to create a complete grapheme, even UTF-32 does not have a 1:1 mapping between code points and graphemes.

    Of course, you can also find interesting character stacks in some comments on SO. For example:



    @̮̘̮̜̤͓͓̓ͪ̓͆͗̑Ṷ̫̠̤̙̻͚̗ͭs̹͓̰̫͉̲̺̈̏̽̅̑ͩ̇̓̉e͖̝̦̦̿r͔̒̿̋̂̓n̹͖̥ͥͦͤ̍͊̏ä͇͖͚͖̃̎͊m̭͇̂͆͋̋͒e̫̠͇̰̱̦̹͗͋̓̿͒ ͔͖̫̬̗̪̪̳ͧ̄ͫB̜̥̣̬̮͈͒̄ͪ͊l̮͉̣̟̪̪̿̍ͫ͋͐̑a̜̦̪͗͗̈́ͣ͊ḫ̘̯͈̠̞͒ͯ ̣͕͚̗̠͖̫̆͌͒̓͛b̖̣͇̖̦̃̑ͬͭͥl͔͍͚͕̲̪̼͎ͧ̇̏ạ̖̪͚̯̊ͤͣͦͮ̌h̘͓͔̟͔͍̏ͣͦ̓̓ ̫̼̫ͮ͌̄ͤ̿̈͆b̙͍̼̜͍̹̬̬͎ͥ̓ͯ̂ḽ̜̟̲̾̅̆ͦ̃ͨa͇̰̝̺͊ͧͫ͛h̯̻͉̉̒̉̈́́ͥ̀.̖̩̭͇̭͔̹̈́̇͐ͬͦͦͨ̾̇.͍̪̣͂ͬ.̞͍̥̪̺̤̣̜͆ͫ̈́͑ͦ͂͑͑



    Why/how do "Zalgo pings" work?

    How does Zalgo text work?



    Ȩ̸҉̟͎͚̹͚̙̟̖x̨͙̰͕̖͉̼̜̲̦̟͈́ͅͅą̷̘͕͈̹͓̣̮̼̣̠̹́c̼͙̠̭̫̰͈͍̮͢͡ţ̢̛̠͇̬̖̟̺͈̲̻̣̲͙͈̼͍̘̱ͅl̶͘‌​̷̨̲͙͖̻̲̗̦͚͙̮͠y̭̖̰͚̞̣̗̳̠͕̻̼͡ͅ!̛͖̮͔͍̰͉͢ ̭̙̖͔̩̗̠͕̦̬͓͞͝ͅO҉҉̣̜̺̪̳͕̖͔̠͙͎͕̙̦ͅn̩͓͖̝̟̭͙͙͓͚̼͖͖͜͞ȩ̧̬̱̦̠̙̥͇͔̪́ ҉̸̗̦͇̰̪̰̭̘̹͘͢i̴͞͏̩̤̹̗̖̰͎̖̲̲̘͓̗̯͚̞͖̥̻͝s͞҉̲͈̙̹̤̫͇ ͚̭͎͉̠̺͉̮̞̻̣̰̺̖͖̀́͢͞e̷̪̭̯̼͓͎̹̠͖̲͔̪͈̦͈̱͍̭̩͠ņ͞҉̮̳͓͙͈̼͉̬͕͈̺͈̭̩̪o͇̗̱̠̱̠̯̕͢u̸̳̦̩̳̫̖̜ͅ‌​ǵ̢̲̣͎̮̮̼̫̥̠͙̱̝̘͕͎̳̜̲̖h̸̛̩͚̮̤̖̹͙.̶̨̳̖̠̗̼̩͕͇͉͓̟̦͜͞ͅ





    What you see, of course, depends on the quality of the Unicode support in your browser (which, in turn, depends in part on the quality of the O/S support). I get to see different results on two different Macs running rather different versions of Firefox, even though they're running the same base O/S version (10.10.4 Yosemite).

    The second of those examples can be decoded from UTF-8 into the following sequence of Unicode code points — it is only 700 bytes on disk:

    0xC8 0xA8 = U+0228
    0xCC 0xB8 = U+0338
    0xD2 0x89 = U+0489
    0xCC 0x9F = U+031F
    0xCD 0x8E = U+034E
    0xCD 0x9A = U+035A
    0xCC 0xB9 = U+0339
    0xCD 0x9A = U+035A
    0xCC 0x99 = U+0319
    0xCC 0x9F = U+031F
    0xCC 0x96 = U+0316
    0x78 = U+0078
    0xCC 0xA8 = U+0328
    0xCD 0x99 = U+0359
    0xCC 0xB0 = U+0330
    0xCD 0x95 = U+0355
    0xCC 0x96 = U+0316
    0xCD 0x89 = U+0349
    0xCC 0xBC = U+033C
    0xCC 0x9C = U+031C
    0xCC 0xB2 = U+0332
    0xCC 0xA6 = U+0326
    0xCC 0x9F = U+031F
    0xCD 0x88 = U+0348
    0xCC 0x81 = U+0301
    0xCD 0x85 = U+0345
    0xCD 0x85 = U+0345
    0xC4 0x85 = U+0105
    0xCC 0xB7 = U+0337
    0xCC 0x98 = U+0318
    0xCD 0x95 = U+0355
    0xCD 0x88 = U+0348
    0xCC 0xB9 = U+0339
    0xCD 0x93 = U+0353
    0xCC 0xA3 = U+0323
    0xCC 0xAE = U+032E
    0xCC 0xBC = U+033C
    0xCC 0xA3 = U+0323
    0xCC 0xA0 = U+0320
    0xCC 0xB9 = U+0339
    0xCC 0x81 = U+0301
    0x63 = U+0063
    0xCC 0xBC = U+033C
    0xCD 0x99 = U+0359
    0xCC 0xA0 = U+0320
    0xCC 0xAD = U+032D
    0xCC 0xAB = U+032B
    0xCC 0xB0 = U+0330
    0xCD 0x88 = U+0348
    0xCD 0x8D = U+034D
    0xCC 0xAE = U+032E
    0xCD 0xA2 = U+0362
    0xCD 0xA1 = U+0361
    0xC5 0xA3 = U+0163
    0xCC 0xA2 = U+0322
    0xCC 0x9B = U+031B
    0xCC 0xA0 = U+0320
    0xCD 0x87 = U+0347
    0xCC 0xAC = U+032C
    0xCC 0x96 = U+0316
    0xCC 0x9F = U+031F
    0xCC 0xBA = U+033A
    0xCD 0x88 = U+0348
    0xCC 0xB2 = U+0332
    0xCC 0xBB = U+033B
    0xCC 0xA3 = U+0323
    0xCC 0xB2 = U+0332
    0xCD 0x99 = U+0359
    0xCD 0x88 = U+0348
    0xCC 0xBC = U+033C
    0xCD 0x8D = U+034D
    0xCC 0x98 = U+0318
    0xCC 0xB1 = U+0331
    0xCD 0x85 = U+0345
    0x6C = U+006C
    0xCC 0xB6 = U+0336
    0xCD 0x98 = U+0358
    0xE2 0x80 0x8C = U+200C
    0xE2 0x80 0x8B = U+200B
    0xCC 0xB7 = U+0337
    0xCC 0xA8 = U+0328
    0xCC 0xB2 = U+0332
    0xCD 0x99 = U+0359
    0xCD 0x96 = U+0356
    0xCC 0xBB = U+033B
    0xCC 0xB2 = U+0332
    0xCC 0x97 = U+0317
    0xCC 0xA6 = U+0326
    0xCD 0x9A = U+035A
    0xCD 0x99 = U+0359
    0xCC 0xAE = U+032E
    0xCD 0xA0 = U+0360
    0x79 = U+0079
    0xCC 0xAD = U+032D
    0xCC 0x96 = U+0316
    0xCC 0xB0 = U+0330
    0xCD 0x9A = U+035A
    0xCC 0x9E = U+031E
    0xCC 0xA3 = U+0323
    0xCC 0x97 = U+0317
    0xCC 0xB3 = U+0333
    0xCC 0xA0 = U+0320
    0xCD 0x95 = U+0355
    0xCC 0xBB = U+033B
    0xCC 0xBC = U+033C
    0xCD 0xA1 = U+0361
    0xCD 0x85 = U+0345
    0x21 = U+0021
    0xCC 0x9B = U+031B
    0xCD 0x96 = U+0356
    0xCC 0xAE = U+032E
    0xCD 0x94 = U+0354
    0xCD 0x8D = U+034D
    0xCC 0xB0 = U+0330
    0xCD 0x89 = U+0349
    0xCD 0xA2 = U+0362
    0x20 = U+0020
    0xCC 0xAD = U+032D
    0xCC 0x99 = U+0319
    0xCC 0x96 = U+0316
    0xCD 0x94 = U+0354
    0xCC 0xA9 = U+0329
    0xCC 0x97 = U+0317
    0xCC 0xA0 = U+0320
    0xCD 0x95 = U+0355
    0xCC 0xA6 = U+0326
    0xCC 0xAC = U+032C
    0xCD 0x93 = U+0353
    0xCD 0x9E = U+035E
    0xCD 0x9D = U+035D
    0xCD 0x85 = U+0345
    0x4F = U+004F
    0xD2 0x89 = U+0489
    0xD2 0x89 = U+0489
    0xCC 0xA3 = U+0323
    0xCC 0x9C = U+031C
    0xCC 0xBA = U+033A
    0xCC 0xAA = U+032A
    0xCC 0xB3 = U+0333
    0xCD 0x95 = U+0355
    0xCC 0x96 = U+0316
    0xCD 0x94 = U+0354
    0xCC 0xA0 = U+0320
    0xCD 0x99 = U+0359
    0xCD 0x8E = U+034E
    0xCD 0x95 = U+0355
    0xCC 0x99 = U+0319
    0xCC 0xA6 = U+0326
    0xCD 0x85 = U+0345
    0x6E = U+006E
    0xCC 0xA9 = U+0329
    0xCD 0x93 = U+0353
    0xCD 0x96 = U+0356
    0xCC 0x9D = U+031D
    0xCC 0x9F = U+031F
    0xCC 0xAD = U+032D
    0xCD 0x99 = U+0359
    0xCD 0x99 = U+0359
    0xCD 0x93 = U+0353
    0xCD 0x9A = U+035A
    0xCC 0xBC = U+033C
    0xCD 0x96 = U+0356
    0xCD 0x96 = U+0356
    0xCD 0x9C = U+035C
    0xCD 0x9E = U+035E
    0xC8 0xA9 = U+0229
    0xCC 0xA7 = U+0327
    0xCC 0xAC = U+032C
    0xCC 0xB1 = U+0331
    0xCC 0xA6 = U+0326
    0xCC 0xA0 = U+0320
    0xCC 0x99 = U+0319
    0xCC 0xA5 = U+0325
    0xCD 0x87 = U+0347
    0xCD 0x94 = U+0354
    0xCC 0xAA = U+032A
    0xCC 0x81 = U+0301
    0x20 = U+0020
    0xD2 0x89 = U+0489
    0xCC 0xB8 = U+0338
    0xCC 0x97 = U+0317
    0xCC 0xA6 = U+0326
    0xCD 0x87 = U+0347
    0xCC 0xB0 = U+0330
    0xCC 0xAA = U+032A
    0xCC 0xB0 = U+0330
    0xCC 0xAD = U+032D
    0xCC 0x98 = U+0318
    0xCC 0xB9 = U+0339
    0xCD 0x98 = U+0358
    0xCD 0xA2 = U+0362
    0x69 = U+0069
    0xCC 0xB4 = U+0334
    0xCD 0x9E = U+035E
    0xCD 0x8F = U+034F
    0xCC 0xA9 = U+0329
    0xCC 0xA4 = U+0324
    0xCC 0xB9 = U+0339
    0xCC 0x97 = U+0317
    0xCC 0x96 = U+0316
    0xCC 0xB0 = U+0330
    0xCD 0x8E = U+034E
    0xCC 0x96 = U+0316
    0xCC 0xB2 = U+0332
    0xCC 0xB2 = U+0332
    0xCC 0x98 = U+0318
    0xCD 0x93 = U+0353
    0xCC 0x97 = U+0317
    0xCC 0xAF = U+032F
    0xCD 0x9A = U+035A
    0xCC 0x9E = U+031E
    0xCD 0x96 = U+0356
    0xCC 0xA5 = U+0325
    0xCC 0xBB = U+033B
    0xCD 0x9D = U+035D
    0x73 = U+0073
    0xCD 0x9E = U+035E
    0xD2 0x89 = U+0489
    0xCC 0xB2 = U+0332
    0xCD 0x88 = U+0348
    0xCC 0x99 = U+0319
    0xCC 0xB9 = U+0339
    0xCC 0xA4 = U+0324
    0xCC 0xAB = U+032B
    0xCD 0x87 = U+0347
    0x20 = U+0020
    0xCD 0x9A = U+035A
    0xCC 0xAD = U+032D
    0xCD 0x8E = U+034E
    0xCD 0x89 = U+0349
    0xCC 0xA0 = U+0320
    0xCC 0xBA = U+033A
    0xCD 0x89 = U+0349
    0xCC 0xAE = U+032E
    0xCC 0x9E = U+031E
    0xCC 0xBB = U+033B
    0xCC 0xA3 = U+0323
    0xCC 0xB0 = U+0330
    0xCC 0xBA = U+033A
    0xCC 0x96 = U+0316
    0xCD 0x96 = U+0356
    0xCC 0x80 = U+0300
    0xCC 0x81 = U+0301
    0xCD 0xA2 = U+0362
    0xCD 0x9E = U+035E
    0x65 = U+0065
    0xCC 0xB7 = U+0337
    0xCC 0xAA = U+032A
    0xCC 0xAD = U+032D
    0xCC 0xAF = U+032F
    0xCC 0xBC = U+033C
    0xCD 0x93 = U+0353
    0xCD 0x8E = U+034E
    0xCC 0xB9 = U+0339
    0xCC 0xA0 = U+0320
    0xCD 0x96 = U+0356
    0xCC 0xB2 = U+0332
    0xCD 0x94 = U+0354
    0xCC 0xAA = U+032A
    0xCD 0x88 = U+0348
    0xCC 0xA6 = U+0326
    0xCD 0x88 = U+0348
    0xCC 0xB1 = U+0331
    0xCD 0x8D = U+034D
    0xCC 0xAD = U+032D
    0xCC 0xA9 = U+0329
    0xCD 0xA0 = U+0360
    0xC5 0x86 = U+0146
    0xCD 0x9E = U+035E
    0xD2 0x89 = U+0489
    0xCC 0xAE = U+032E
    0xCC 0xB3 = U+0333
    0xCD 0x93 = U+0353
    0xCD 0x99 = U+0359
    0xCD 0x88 = U+0348
    0xCC 0xBC = U+033C
    0xCD 0x89 = U+0349
    0xCC 0xAC = U+032C
    0xCD 0x95 = U+0355
    0xCD 0x88 = U+0348
    0xCC 0xBA = U+033A
    0xCD 0x88 = U+0348
    0xCC 0xAD = U+032D
    0xCC 0xA9 = U+0329
    0xCC 0xAA = U+032A
    0x6F = U+006F
    0xCD 0x87 = U+0347
    0xCC 0x97 = U+0317
    0xCC 0xB1 = U+0331
    0xCC 0xA0 = U+0320
    0xCC 0xB1 = U+0331
    0xCC 0xA0 = U+0320
    0xCC 0xAF = U+032F
    0xCC 0x95 = U+0315
    0xCD 0xA2 = U+0362
    0x75 = U+0075
    0xCC 0xB8 = U+0338
    0xCC 0xB3 = U+0333
    0xCC 0xA6 = U+0326
    0xCC 0xA9 = U+0329
    0xCC 0xB3 = U+0333
    0xCC 0xAB = U+032B
    0xCC 0x96 = U+0316
    0xCC 0x9C = U+031C
    0xCD 0x85 = U+0345
    0xE2 0x80 0x8C = U+200C
    0xE2 0x80 0x8B = U+200B
    0xC7 0xB5 = U+01F5
    0xCC 0xA2 = U+0322
    0xCC 0xB2 = U+0332
    0xCC 0xA3 = U+0323
    0xCD 0x8E = U+034E
    0xCC 0xAE = U+032E
    0xCC 0xAE = U+032E
    0xCC 0xBC = U+033C
    0xCC 0xAB = U+032B
    0xCC 0xA5 = U+0325
    0xCC 0xA0 = U+0320
    0xCD 0x99 = U+0359
    0xCC 0xB1 = U+0331
    0xCC 0x9D = U+031D
    0xCC 0x98 = U+0318
    0xCD 0x95 = U+0355
    0xCD 0x8E = U+034E
    0xCC 0xB3 = U+0333
    0xCC 0x9C = U+031C
    0xCC 0xB2 = U+0332
    0xCC 0x96 = U+0316
    0x68 = U+0068
    0xCC 0xB8 = U+0338
    0xCC 0x9B = U+031B
    0xCC 0xA9 = U+0329
    0xCD 0x9A = U+035A
    0xCC 0xAE = U+032E
    0xCC 0xA4 = U+0324
    0xCC 0x96 = U+0316
    0xCC 0xB9 = U+0339
    0xCD 0x99 = U+0359
    0x2E = U+002E
    0xCC 0xB6 = U+0336
    0xCC 0xA8 = U+0328
    0xCC 0xB3 = U+0333
    0xCC 0x96 = U+0316
    0xCC 0xA0 = U+0320
    0xCC 0x97 = U+0317
    0xCC 0xBC = U+033C
    0xCC 0xA9 = U+0329
    0xCD 0x95 = U+0355
    0xCD 0x87 = U+0347
    0xCD 0x89 = U+0349
    0xCD 0x93 = U+0353
    0xCC 0x9F = U+031F
    0xCC 0xA6 = U+0326
    0xCD 0x9C = U+035C
    0xCD 0x9E = U+035E
    0xCD 0x85 = U+0345
    0x0A = U+000A
    

    It gets tricky to decipher which parts of that are graphemes, but clearly, with all the stacked characters, this is not a fixed amount of data per grapheme, and there is no sane way to make Unicode work with a fixed width encoding per grapheme because, as the 'Zalgo' examples show, combining marks can basically be combined in arbitrary sequences.

    The first grapheme in the second 'Zalgo' example contains:

    0xC8 0xA8 = U+0228    LATIN CAPITAL LETTER E WITH CEDILLA
    0xCC 0xB8 = U+0338    COMBINING LONG SOLIDUS OVERLAY
    0xD2 0x89 = U+0489    CYRILLIC COMBINING MILLIONS SIGN
    0xCC 0x9F = U+031F    COMBINING PLUS SIGN BELOW
    0xCD 0x8E = U+034E    COMBINING UPWARDS ARROW BELOW
    0xCD 0x9A = U+035A    COMBINING DOUBLE RING BELOW
    0xCC 0xB9 = U+0339    COMBINING RIGHT HALF RING BELOW
    0xCD 0x9A = U+035A    COMBINING DOUBLE RING BELOW
    0xCC 0x99 = U+0319    COMBINING RIGHT TACK BELOW
    0xCC 0x9F = U+031F    COMBINING PLUS SIGN BELOW
    0xCC 0x96 = U+0316    COMBINING GRAVE ACCENT BELOW
    

    The next code point is U+0078 LATIN SMALL LETTER X, the start of a new grapheme. A couple of the combining marks appear several times each in that list.