I have got a file, which can be opened in VSCode editor as a normal text file. But if I try to read it in Python:
with open("file.ass") as f:
for line in f.readlines():
...
it will throw an exception:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
If I try to open it with binary mode, the first a few bytes read like:
f = open("file.ass", "rb")
b = f.read()
print(b[:50])
Out[36]: b'\xff\xfe[\x00S\x00c\x00r\x00i\x00p\x00t\x00 \x00I\x00n\x00f\x00o\x00]\x00\r\x00\n\x00;\x00 \x00S\x00c\x00r\x00i\x00p\x00t\x00 \x00'
If I do decode('utf-16')
, I can see the correct characters.
b[:50].decode('utf-16')
Out[58]: '[Script Info]\r\n; Script '
But I am wondering if there is a more elegant way to handle such files like a normal text files. In another word, how could I know if I need to do decode('utf-16')
and use readlines()
like reading a normal text file? Thanks.
You can use encoding
when opening the file.
with open("file.ass", encoding='utf-16') as f:
for line in f.readlines():
...
As you said that your file opens in VSCode you can check the guessed encoding of the file at the bottom of your VSCode:
You can then use the same encoding with Python.
Note that VSCode uses "files.autoGuessEncoding":true
(you can check it in the settings). So that's why you have it reading your file as a "normal text".