Search code examples
encodingcharacter-encodingwindows-1252

How can Euro sign character be a value higher than 0xFF in Win-1252 and Latin9?


There's a gap in my knowledge about charsets, encodings, etc. In the Windows-1252 and ISO/IEC 8859-15 (Latin-9) code-pages, the value of the Euro sign (€) is given as 0x20AC--which is 8364 in decimal! But how can that be when the idea is that any of the characters in this encoding must fit into a single unsigned byte (i.e. the maximum value can be 255)? US-ASCII takes up values 0-127, and 128-255 is where the variation was between different character encodings.

When I enter the text into a text editor (vim):

a € b

And save it to a file with encoding 'latin9'. I see this file consisting of the following bytes:

$ xxd euro-file.txt
0000000: 6120 e282 ac20 620a                      a ... b.

OK so:

0x61 = 'a' character
0x20 = space character
0xE282 = ???
0xAC20 = This is the value of the Euro symbol, but the bytes are backwards; the reference said the value should be 0x20AC
0x62 = 'b' character

Could someone please explain how the Euro character can have a value higher than 255? Why written bytes for the Euro character are backwards (0xAC20 instead of 0x20AC)?


Solution

  • The character is merely denoted by its Unicode code point, which is U+20AC. It does not denote the byte value in the Latin-9/CP1252 encoding tables. It's just listed this way presumably to disambiguate which character exactly is meant; the Unicode table is a pretty good canonical reference.

    That file you're running through xxd is apparently encoded in UTF-8, where "€" is encoded using the bytes E2 82 AC.

    You may want to start here: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.