Search code examples
linuxhexdump

Using hexdump and how to find associated character?


I execute hexdump on a data file and it prints out the following :

        > hexdump myFile.data
          a4c3

After switching byte order I have the following :

          c3a4 

Do I assume those HEX values are actual Unicode values? If so, the values are :

enter image description here

and

enter image description here

Or do I take the c3a4 and treat it as UTF-8 data (since my Putty session is set to UTF-8) then convert it to Unicode?

If so, it results into E4 which then is enter image description here

Which is the proper interpretation?


Solution

  • You cannot assume those hex values are Unicode values. In fact, hexdump will never (well, see below...) give you Unicode values.

    Those hex values represent the binary data as it was written to disk when the file was created. But in order to translate that data back to any specific characters/symbols/glyphs, you need to know what specific character encoding was used when the file was created (ASCII, UTF-8, and so on).

    Also, I recommend using hexdump with the -C option (that's the uppercase C) to give the so-called "canonical" representation of the hex data:

    c3 a4 0a
    

    In my case, there is also a 0a representing a newline character.

    So, in the above example we have 0xc3 followed by 0xa4 (I added the 0x part to indicate we are dealing with hex values). I happen to know that this file used UTF-8 when it was created. I can therefore determine that the character in the file is ä (also referred to by Unicode U+00e4).

    But the key point is: you must know how the file was encoded, to know with certainty how to interpret the bytes provided by hexdump.


    Unicode is (amongst other things) an abstract numbering system for characters, separate from any specific encoding. That is one of the reasons why it is so useful. But it just so happens that its designers used the same encoding as ASCII for the initial set of characters. So that is why ASCII letter a has the same code value as Unicode a. As you can see with Unicode vs. UTF-8, the encodings are not the same, once you get beyond that initial ASCII code range.