Search code examples
encodingutf-8translationansi

Fixing file encoding


Today I ordered a translation for 7 different languages, and 4 of them appear to be great, but when I opened the other 3, namely Greek, Russian, and Korean, the text that was there wasn't related to any language at all. It looked like a bunch of error characters, like the kind you get when you have the wrong encoding on a file.

For instance, here is part of the output of the Korean translation:

½Ì±ÛÇ÷¹À̾î

¸ÖƼÇ÷¹À̾î

¿É¼Ç

I may not even speak a hint of Korean, but I can tell you with all certainty that is not Korean.

I assume this is a file encoding issue, and when I open the file in Notepad, the encoding is listed as ANSI, which is clearly a problem; the same can be said for the other two languages.

Does anyone have any ideas on how to fix the encoding of these 3 files; I requested the translators reupload in UTF-8, but in the meantime, I thought I might try to fix it myself.

If anyone is interested in seeing the actual files, you can get them from my Dropbox.


Solution

  • If you look at the byte stream as pairs of bytes, they look vaguely Korean but I cannot tell of they are what you would expect or not.

    bash$ python3.4
    Python 3.4.3 (v3.4.3:b4cbecbc0781, May 30 2015, 15:45:01)
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> buf = '½Ì±ÛÇ÷¹À̾î'
    >>> [hex(ord(b)) for b in buf]
    >>> ['0xbd', '0xcc', '0xb1', '0xdb', '0xc7', '0xc3', '0xb7', '0xb9', '0xc0', '0xcc', '0xbe', '0xee']
    >>> u'\uBDCC\uB1DB\uC7C3\uB7B9\uC0CC\uBEEE'
    '뷌뇛쟃랹샌뻮'
    

    Your best bet is to wait for the translator to upload UTF-8 versions or have them tell you the encoding of the file. I wouldn't make the assumption that they bytes are simply 16 bit characters.

    Update

    I passed this through the chardet module and it detected the character set as EUC-KR.

    >>> import chardet
    >>> chardet.detect(b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE')
    {'confidence': 0.833333333333334, 'encoding': 'EUC-KR'}
    >>> b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE'.decode('EUC-KR')
    '싱글플레이어'
    

    According to Google translate, the first line is "Single Player". Try opening it with Notepad and using EUC-KR as the encoding.