Search code examples
pythonencoding

python reading text file with prefixing FF FE bytes


I have got a file, which can be opened in VSCode editor as a normal text file. But if I try to read it in Python:

with open("file.ass") as f:
   for line in f.readlines():
       ...

it will throw an exception:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

If I try to open it with binary mode, the first a few bytes read like:

f = open("file.ass", "rb")
b = f.read()
print(b[:50])

Out[36]: b'\xff\xfe[\x00S\x00c\x00r\x00i\x00p\x00t\x00 \x00I\x00n\x00f\x00o\x00]\x00\r\x00\n\x00;\x00 \x00S\x00c\x00r\x00i\x00p\x00t\x00 \x00'

If I do decode('utf-16'), I can see the correct characters.

b[:50].decode('utf-16')
Out[58]: '[Script Info]\r\n; Script '

But I am wondering if there is a more elegant way to handle such files like a normal text files. In another word, how could I know if I need to do decode('utf-16') and use readlines() like reading a normal text file? Thanks.


Solution

  • You can use encoding when opening the file.

    with open("file.ass", encoding='utf-16') as f:
       for line in f.readlines():
           ...
    

    As you said that your file opens in VSCode you can check the guessed encoding of the file at the bottom of your VSCode: VSCod showing encoding with UTF-8

    You can then use the same encoding with Python.

    Note that VSCode uses "files.autoGuessEncoding":true (you can check it in the settings). So that's why you have it reading your file as a "normal text".