There's a gap in my knowledge about charsets, encodings, etc. In the Windows-1252 and ISO/IEC 8859-15 (Latin-9) code-pages, the value of the Euro sign (€) is given as 0x20AC--which is 8364 in decimal! But how can that be when the idea is that any of the characters in this encoding must fit into a single unsigned byte (i.e. the maximum value can be 255)? US-ASCII takes up values 0-127, and 128-255 is where the variation was between different character encodings.
When I enter the text into a text editor (vim):
a € b
And save it to a file with encoding 'latin9'. I see this file consisting of the following bytes:
$ xxd euro-file.txt
0000000: 6120 e282 ac20 620a a ... b.
OK so:
0x61 = 'a' character
0x20 = space character
0xE282 = ???
0xAC20 = This is the value of the Euro symbol, but the bytes are backwards; the reference said the value should be 0x20AC
0x62 = 'b' character
Could someone please explain how the Euro character can have a value higher than 255? Why written bytes for the Euro character are backwards (0xAC20 instead of 0x20AC)?
The character is merely denoted by its Unicode code point, which is U+20AC. It does not denote the byte value in the Latin-9/CP1252 encoding tables. It's just listed this way presumably to disambiguate which character exactly is meant; the Unicode table is a pretty good canonical reference.
That file you're running through xxd
is apparently encoded in UTF-8, where "€" is encoded using the bytes E2 82 AC
.
You may want to start here: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.