Search code examples
utf-8character-encodingwindows-1252

Projecting number of bytes for double, triple etc encoding cp1252->UTF-8


So I recently came across this question:

How do I recreate these special characters when testing a website?

where the user describes a situation in which a client sends a string with accented characters in it, and then the string gets repeatedly encoded in UTF-8 and decoded in Windows-1252.

For example, here is what one client sent:

Kovovýroba Navalany

And here is what that string turned into after being misinterpreted eight times:

Kovovýroba Navalany

in which the single accented character (ý) became 468 characters of gibberish.

Building off another user's response, I finally figured out the root problem and the number of misinterpretations that occurred by purposely simulating the errors in Python. This is what I came up with:

Number of                       Result
misinterpretations
----------------------------------------
0                               ý
1                               ý
2                               ý
3                               ý
4                               ý
5                               ý
6                               ý
7                               ý
8                               ý

I later began wondering if there was some sort of pattern in these strings, beginning with their length. So, I wrote a short piece of Python code to return the length of a particular character after n misencodings:

def misencodinglength(str, n):
    for i in range(0, n):
        str = str.encode("utf-8").decode("cp1252")
    return len(str)

Using this function, I cycled through this process 24 times. The lengths of these strings are:

Misencodings    Length
--------------------------------
0               1
1               2
2               4
3               8
4               18
5               42
6               95
7               211
8               468
9               1040
10              2314
11              5152
12              11472
13              25541
14              56857
15              126566
16              281744
17              627188
18              1396186
19              3108058
20              6918859
21              15402083
22              34286592
23              76325420
24              169908118

I then inputted the lengths of the first 15 of these numbers into Wolfram Alpha, and it also tried to predict a possible continuation for those numbers. I was amazed that its predictions for terms 15 through 24 were exactly the same as what I got.

It was also able to determine a generating function for this sequence:

G_n(a_n)(z) = (-z^5 - z^4 + z^3 - z^2 + z - 1)/(z^6 + z^5 - z^4 + 3 z^3 - 3 z^2 + 3 z - 1)

Another user in the same discussion had a similar problem with a Right Single Quotation Mark being used as an apostrophe that was being repeatedly encoded as UTF-8 and decoded as Windows-1252, producing ’ and eventually ’.

Number of                       Result
misinterpretations
----------------------------------------
0                               ’
1                               ’
2                               ’
3                               ’
4                               ’
5                               ’
6                               ’
7                               ’
8                               ’

After generating the subsequent misinterpretations, their lengths are:

Misencodings    Length
--------------------------------
0               1
1               3
2               8
3               18
4               38
5               82
6               182
7               407
8               909
9               2026
10              4510
11              10036
12              22336
13              49720
14              110685
15              246403
16              548524
17              1221070
18              2718218
19              6051018
20              13470174
21              29985979
22              66751845
23              148596390
24              330790634

After inputting the first 15 of these terms into Wolfram Alpha, the subsequent terms 15 through 24 that it predicted were exactly the same as what I got above.

The generating function in this case was

G_n(a_n)(z) = (-2 z^2 - 1)/(z^6 + z^5 - z^4 + 3 z^3 - 3 z^2 + 3 z - 1)

My question is: It seems that it is possible to determine the exact number of characters in these misinterpretations in advance. Is this a coincidence? Or is there something else going on in which we can derive the results?


Solution

  • The explanation is rather simple, actually.

    Characters in the range 0-127 are the same in ASCII, in code page 1252, and in UTF-8.

    Generally, code page 1252 characters in the range 128-255 correspond to Unicode code points with the same numbers, which are represented as two bytes in UTF-8.

    However, a few of the characters in code page 1252 do not straightforwardly map to the Unicode code points with the same number. In your example, the sequence ‚ (which occurs twice), i.e. the three bytes 0xE2 0x80 0x9A, is the UTF-8 encoding of cp1252 character 0x82 (decimal 130) which maps to U+201A SINGLE LOW-9 QUOTATION MARK (used as an opening single quote mark in some languages, e.g. German; similar to, but in some fonts clearly distinct from, a comma).

    Notice how the code points in your example string which encode to two bytes in UTF-8 begin with Ã, À, etc whereas the three-byte sequence starts with â (or, in the general case, ã, à, etc).

    The Wikipedia article for Windows code page 1252 contains a table which nicely documents the Unicode code point for each character which does not have an identity mapping in Unicode. Of particular interest here are the ones in the range 0x80-0x9F (decimal 128-159), many of which map to Unicode code points which encode into three bytes in UTF-8. (Code points from U+0800 through U+FFFF have this property.)

    Screenshot of interesting part of Wikipedia table

    For more details about the UTF-8 encoding, see also its Wikipedia page.

    The formula for how many bytes a string will require when double, triple etc encoded depends on the precise input. Statistically, 17 characters in code page 1252 correspond to 3-byte sequences in UTF-8, 106 correspond to 2-byte sequences, and the low 128 map identically to a 1-byte sequence (and 5 are undefined) so on average, you could project a factor of 1.557768924302789 for each encoding round (do I get this right? (17x3 + 106x2 + 128) / 251)); however, see below for some complications and corrections.

    It's not hard to find sequences which require exactly the same or exactly the double number of bytes compared to the first round, but from there on you will get some triplication in subsequent rounds if one of the resulting bytes in the UTF-8 representation happens to be one of the problematic ones; in fact, this is increasingly likely the further you go. Conversely, if you want to specifically target character codes which maximize the length when you repeatedly re-encode them, the character codes which map to a Unicode code point in the range slightly above U+2000 will all exhibit at least one triplication already in the first round; the middle byte in the UTF-8 encoding falls in the range which will exhibit another triplication on the next round of encoding (so, two of the characters will map to two bytes, and one will map to three, for a total factor of 7/3); and a few of them have two bytes which map to three bytes (namely 146, 150, and 151; so a yield of 8/3).

    Examining character code 146 (0x92) for example, it encodes into the UTF-8 bytes 0xE2 0x80 0x99; and on the next round, the code page 1252 character 226 (0xE2) maps to 0xC3 0xA2, the character 128 (0x80) maps to 0xE2 0x82 0xAC, and the character 153 (0x99) maps to 0xE2 0x84 0xA2. You'll note that 0x82 and 0x84 will exhibit triplication again in the next subsequent round of encoding.

    Also of interest is 148, whose UTF-8 encoding contains a byte which is not a valid character code in code page 1252; this could map to U+FFFD (which represents an unknown or unrepresentable character, and requires four bytes in its UTF-8 encoding), or cause your program to crash, depending on what options exactly you use for producing the mojibake. On subsequent rounds of encoding, several other code page 1252 characters land on byte codes which are undefined in code page 1252 when you map them to UTF-8.

    Returning again to your example Czech string Kovovýroba Navalany, it contains 19 characters, 18 of which are simple ASCII characters which map back to themselves, so corresponding to a simple constant 1, and ý whose polynomial you discovered. Some non-ASCII characters will have a much simpler polynomial (doubling on each iteration; G(ax = x2) but many will have more complex polynomials. If you wanted to, you could map out which constant exponent each polynomial converges towards, and produce an even more precise statistical projection for the entire character range. Basically, regard each character as the root of a tree where some branches always have two children, doubling each time you proceed down the tree, regardless of which branch you take, some others are trivially linear, while still others have a different structure, where some branches have three children, indicating a triplication in the next step, and then a tree with a self-similar structure further down that branch; and some branches terminate with an error (for the five undefined character codes). But I'm leaving exploring this in full detail as an exercise.

    Here is a simple demo which tabulates the first three rounds of repeated re-encoding for character codes 128 to 255: https://ideone.com/wiruJ0

    The most common pattern in its output is "2 4 9" i.e. duplication on the first two rounds, but one triplication in the third (41 occurrences) though the regular duplication pattern "2 4 8" is also prevalent (38 occurrences). If you add one more layer, you'll see that those too will inevitably have at least two triplications in the fourth round.

    In fact, here is another demo which shows two sets of 37 character codes which stabilize on one convergent series each, and two sets of 11 each on another, with all other variations containing only one or two character codes. This explores 16 rounds of re-encoding (with a slightly different representation, which throws away 10 character codes for encoding errors after the first round - five because they are undefined in the first place, and another five because they map to an undefined character code in the first round): https://ideone.com/vnDKtJ - actually, these patterns emerge already after five rounds, and remain stable even if you extend to 20 rounds. You'll notice that the ones which start with "1,2,4,8" then jump to either 18 or 19, not 16.