Search code examples
python-3.xunicodeencodingbytedecoding

Encoding issues related to Python and foreign languages


Here's a problem I am facing with encoding and decoding texts.

I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.

Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to: 1. encode my string into a byte corresponding to the encoding of the file 2. match the file and get the path of that file

I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'

with open(os.path.join(root, filename), mode='rb') as f:
    text = f.read()
    print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
    print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...

As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]

However, the funny thing is that if I literally convert the string into bytes, I get:

enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>

On the other hand, opening the file using

with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...

I get:

StaticText   [¶]Online¹A³õ]   €?‹  Œ  î...

which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem

What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?

Thank you!


Solution

  • Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.

    Note: guessing the encoding of a given byte string is always approximate. There's no safe way of determining the encoding for sure.

    If you have a byte string like

    b'[\xb6]Online\xb9A\xb3\xf5]'
    

    and you know it must translate (be decoded) into

    '[跑Online農場]'
    

    then what you can is trial and error with a few codecs.

    I did this with the list of codecs supported by Python, searching for codecs for Chinese.

    When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:

    >>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
    '[¶]Online¹A³õ]'
    

    When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:

    >>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
    '[跑Online農場]'
    

    So: use CP-950 for reading the file.