Search code examples
pythoncharacter-encodingcjk

How do decode b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"?


[Summary]: The data grabbed from the file is

b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"

How to decode these bytes into readable Chinese characters please?

======

I extracted some game scripts from an exe file. The file is packed with Enigma Virtual Box and I unpacked it.

Then I'm able to see the scripts' names just right, in English, as it supposed to be.

In analyzing these scripts, I get an error looks like this:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 0: invalid start byte

I changed the decoding to GBK, and the error disappeared.

But the output file is not readable. It includes readable English characters and non-readable content which supposed to be in Chinese. Example:

chT0002>pDIӘIʆ

I tried different encodings for saving the file and they show the same result, so the problem might be on the decoding part.

The data grabbed from the file is

b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"

I tried many ways but I just can't decode these bytes into readable Chinese characters. Is there anything wrong with the file itself? Or somewhere else? I really need help, please.

One of the scripts are attached here.


Solution

  • In order to reliably decode bytes, you must know how the bytes were encoded. I will borrow the quote from the python codecs docs:

    Without external information it’s impossible to reliably determine which encoding was used for encoding a string.

    Without this information, there are ways to try and detect the encoding (chardet seems to be the most widely-used). Here's how you could approach that.

    import chardet
    
    data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
    detected = chardet.detect(data)
    decoded = data.decode(detected["encoding"])
    

    The above example, however, does not work in this case because chardet isn't able to detect the encoding of these bytes. At that point, you'll have to either use trial-and-error or try other libraries.

    One method you could use is to simply try every standard encoding, print out the result, and see which encoding makes sense.

    codecs = [
        "ascii", "big5", "big5hkscs", "cp037", "cp273", "cp424", "cp437", "cp500", "cp720", 
        "cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858", "cp860",
        "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875",
        "cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1125", "cp1140", "cp1250",
        "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257",
        "cp1258", "cp65001", "euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312",
        "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2",
        "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1",
        "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6", "iso8859_7",
        "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_11", "iso8859_13", "iso8859_14",
        "iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_t", "koi8_u", "kz1048",
        "mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman",
        "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004", "shift_jisx0213",
        "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7",
        "utf_8", "utf_8_sig",
    ]
    
    data = b"\x95\xc3\x8a\xb0\x8ds\x86\x89\x94\x82\x8a\xba"
    
    for codec in codecs:
        try:
            print(f"{codec}, {data.decode(codec)}")
        except UnicodeDecodeError:
            continue
    

    Output

    cp037, nC«^ýËfimb«[
    cp273, nC«¢ýËfimb«¬
    cp437, ò├è░ìsåëöéè║
    cp500, nC«¢ýËfimb«¬
    cp720, ـ├è░së¤éè║
    cp737, Χ├Λ░ΞsΗΚΦΓΛ║
    cp775, Ģ├Ŗ░ŹsåēöéŖ║
    cp850, ò├è░ìsåëöéè║
    cp852, Ľ├Ő░ŹsćëöéŐ║
    cp855, Ћ├і░ЇsєЅћѓі║
    cp856, ץ├ך░םsזיפגך║
    cp857, ò├è░ısåëöéè║
    cp858, ò├è░ìsåëöéè║
    cp860, ò├è░ìsÁÊõéè║
    cp861, þ├è░Þsåëöéè║
    cp862, ץ├ך░םsזיפגך║
    cp863, Ï├è░‗s¶ëËéè║
    cp864, ¼ﺃ├٠┌s│┬½∙├ﻑ
    cp865, ò├è░ìsåëöéè║
    cp866, Х├К░НsЖЙФВК║
    cp875, nCα£δΉfimbας
    cp949, 빩뒺뛱냹봻듆
    cp1006, ﺣﺍsﭦ
    cp1026, nC«¢`Ëfimb«¬
    cp1125, Х├К░НsЖЙФВК║
    cp1140, nC«^ýËfimb«[
    cp1250, •ĂŠ°Ťs†‰”‚Šş
    cp1251, •ГЉ°Ќs†‰”‚Љє
    cp1256, •أٹ°چs†‰”‚ٹ؛
    gbk, 暶姲峴唹攤姾
    gb18030, 暶姲峴唹攤姾
    latin_1, ðsº
    iso8859_2, Ă°sş
    iso8859_4, ðsē
    iso8859_5, УАsК
    iso8859_7, Γ°sΊ
    iso8859_9, ðsº
    iso8859_10, ðsš
    iso8859_11, รฐsบ
    iso8859_13, Ć°sŗ
    iso8859_14, ÃḞsẃ
    iso8859_15, ðsº
    iso8859_16, Ă°sș
    koi8_r, ∙ц┼╟█s├┴■┌┼╨
    koi8_u, ∙ц┼╟█s├┴■┌┼╨
    kz1048, •ГЉ°Қs†‰”‚Љғ
    mac_cyrillic, Х√К∞НsЖЙФВКЇ
    mac_greek, ïΟäΑçsÜâî²äΚ
    mac_iceland, ï√ä∞çsÜâîÇä∫
    mac_latin2, ē√äįćsÜČĒāäļ
    mac_roman, ï√ä∞çsÜâîÇä∫
    mac_turkish, ï√ä∞çsÜâîÇä∫
    ptcp154, •ГҠ°ҚsҶү”ӮҠә
    shift_jis_2004, 陛寛行̹狽桓
    shift_jisx0213, 陛寛行̹狽桓
    utf_16, 쎕낊玍覆芔몊
    utf_16_be, 闃誰赳蚉钂誺
    utf_16_le, 쎕낊玍覆芔몊
    

    Edit: After running all of the seemingly legible results through Google Translate, I suspect this encoding is UTF-16 big-endian. Here's the results:

    Encoding Decoded Language Detected English Translation
    gbk 暶姲峴唹攤姾 Chinese Jian Xian JiaoTanJiao
    gb18030 暶姲峴唹攤姾 Chinese Jian Xian Jiao Tan Jiao
    utf_16 쎕낊玍覆芔몊 Korean None
    utf_16_be 闃誰赳蚉钂誺 Chinese Who is the epiphysis?
    utf_16_le 쎕낊玍覆芔몊 Korean None