Search code examples
utf-8google-colaboratorygb2312

Decoding gb-2312 file in colab


I am trying to open a file in Colab that uses gb-2312 encoding. Here is the code I successfully ran in my IDE to read and decode:

file = open(r'file.txt')
opened = file.read()
decoded = opened.encode('latin1').decode('gb2312')
print(decoded)

When I run this code in colab, I get the following error:

'utf-8' codec can't decode byte 0xc6 in position 67: invalid continuation byte

But I can't decode without using read() or list() first, or else I get the following error:

'_io.TextIOWrapper' object has no attribute 'encode'

This seems like a catch-22. Is this a bug with Colab or is there some better way to approach the problem?


Solution

  • The default when opening a file is rt (read, text mode) and uses an OS-specific default encoding returned by locale.getpreferredencoding(False). Use the encoding parameter to override the default (which appears to be utf-8):

    with open('file.txt', encoding='gb2312') as file:
        data = file.read()