I execute hexdump on a data file and it prints out the following :
> hexdump myFile.data
a4c3
After switching byte order I have the following :
c3a4
Do I assume those HEX values are actual Unicode values? If so, the values are :
and
Or do I take the c3a4 and treat it as UTF-8 data (since my Putty session is set to UTF-8) then convert it to Unicode?
If so, it results into E4 which then is
Which is the proper interpretation?
You cannot assume those hex values are Unicode values. In fact, hexdump
will never (well, see below...) give you Unicode values.
Those hex values represent the binary data as it was written to disk when the file was created. But in order to translate that data back to any specific characters/symbols/glyphs, you need to know what specific character encoding was used when the file was created (ASCII, UTF-8, and so on).
Also, I recommend using hexdump
with the -C
option (that's the uppercase C) to give the so-called "canonical" representation of the hex data:
c3 a4 0a
In my case, there is also a 0a
representing a newline character.
So, in the above example we have 0xc3
followed by 0xa4
(I added the 0x
part to indicate we are dealing with hex values). I happen to know that this file used UTF-8 when it was created. I can therefore determine that the character in the file is ä
(also referred to by Unicode U+00e4).
But the key point is: you must know how the file was encoded, to know with certainty how to interpret the bytes provided by hexdump
.
Unicode is (amongst other things) an abstract numbering system for characters, separate from any specific encoding. That is one of the reasons why it is so useful. But it just so happens that its designers used the same encoding as ASCII for the initial set of characters. So that is why ASCII letter a
has the same code value as Unicode a
. As you can see with Unicode vs. UTF-8, the encodings are not the same, once you get beyond that initial ASCII code range.