Search code examples
unicodedisassemblyperiod

Why is Unicode stored with periods in-between characters?


So, right now I'm making a small package reader in Java. All the unicode strings have periods (or at least that's how they are presented in hex editor) so when I read them I need to go to the offset and read the allocated memory for that information. Like, if it's a game name from an Xbox 360 file, I need to read 80 bytes and remove the '.'s from it to get a readable string.

So why is unicode stored like this in files? Is it to signify that it's Unicode or is it allocation padding or what?

I'm not sure if my question is valid it's just always been on my mind. Thanks.


Solution

  • Create a file containing "A" in Notepad, save it as Unicode and Windows will use UTF-16(LE) Encoding to do so; this uses 2 bytes to store the character: 0x41 0x00.

    When you view this file in a hex editor (which knows nothing about nor cares about text encoding) 0x41 can be displayed as A but 00 maps to no character so a . (or equivalent) is displayed to let you know there is a byte there.