Search code examples
encodingutf-8decodeencode

what type of encode is that


I have a dump file with a lot of lines like this:

$0414$0436$0435$0434$0430$0439
$05DE$05E1$05D3$05E8_$05D4$05D2$0027$05D3$05D9$05D9

I assume the strings above do means "Джедаи" (russian) and "מסדר_הג'דיי" (hebrew).

How can I decode these strings ?
Which encode is that ?


Solution

  • The file contains UTF-16 code units formatted as 16bit hex strings, each beginning with $. Except for the _ ASCII character (U+005F) in מסדר_הג'דיי, which has been written to the file as-is instead of being hex encoded. Oddly, the ' ASCII character (U+0027) in מסדר_הג'דיי has been hex encoded.

    To decode this, you would read the file one character at a time. When you detect a $ character, skip it and hex-decode the next 4 characters into a 16bit value, otherwise treat the character as-is as a 16bit value. Build up a string of these 16bit values, and you will have a UTF-16 encoded string.