Search code examples
pythonutf-8iso-8859-1

broken CJK data when reading ISO-8859-1 file in python


I'm parsing some file that is ISO-8859-1 and has Chinese, Japanese, Korean characters in it.

import os
from os import listdir

cnt = 0
base_path = 'data/'
cwd = os.path.abspath(os.getcwd())
for f in os.listdir(base_path):
    path = cwd + '/' + base_path + f
    cnt = 0
    with open(path, 'r', encoding='ISO-8859-1') as file:
        for line in file:
            print('line {}: {}'.format(cnt, line))
            cnt +=1 

The code runs but it prints broken characters. Other stackoverflow questions suggest I use encode and decode. For example, for Korean texts, I tried file.read().encode('latin1').decode('euc-kr'), but that didn't do anything. I also tried to convert the files into utf-8 using iconv but the characters are still broken in the converted text file.

Any suggestions would be much appreciated.


Solution

  • Sorry, no. ISO-8859-1 cannot have any Chinese, Japanese, nor Korean characters in it. The code page doesn't support them at the first place.

    What you did in the code is to ask Python to assume the file is in ISO-8859-1 encoding and return characters in Unicode (which is how strings are built). If you do not specify the encoding parameter in open(), the default would be assuming UTF-8 encoding use in the file and still return in Unicode, i.e. logical characters without any encoding specified.

    Now the question is how are those CJK characters encoded in the file. If you know the answer, you can just put the right encoding parameter in open() and it works right away. Let's say it is EUC-KR as you mentioned, the code should be:

    with open(path, 'r', encoding='euc-kr') as file:
        for line in file:
            print('line {}: {}'.format(cnt, line))
            cnt +=1 
    

    If you feel frustrated, please take a look at chardet. It should help you detect the encoding from text. Example:

    import chardet
    with open(path, 'rb') as file:
        rawdata = file.read()
        guess = chardet.detect(rawdata) # e.g. {'encoding': 'EUC-KR', 'confidence': 0.99}
        text = guess.decode(guess['encoding'])
        cnt = 0
        for line in text.splitlines():
            print('line {}: {}'.format(cnt, line))
            cnt +=1