I receive files from an undocumented resource that can contain data that looks like:
16058637149881541301278JA1コノマンガガスゴイヘンシュウブ4
#recordsWritten:1293462
The above is just an example, the files I'm working with contain all kinds of different languages (and thus encodings). I'm then opening my file with Python 3.6 (an inherited code base that I've upped from Python 2 to Python 3) using the following code:
import os
f = open(file_path, "r")
f.seek(0, os.SEEK_END)
f.seek(f.tell() -40, os.SEEK_SET)
records_str = f.read()
print(records_str)
Using this code, I receive a: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte
if I change it to include an encoding:
f = open(file_path, "r", encoding='utf-8')
, I receive the same error.
Changing the encoding to utf-16
results in it printing:
랂菣Ꚃ菣Ɩȴ⌊敲潣摲坳楲瑴湥ㄺ㤲㐳㈶ਂ
Which appears to be wrong.
Switching it to open the file in binary mode: f = open(file_path, "rb")
results in it outputting:
b'\x82\xb7\xe3\x83\xa5\xe3\x82\xa6\xe3\x83\x96\x014\x02\n#recordsWritten:1293462\x02\n'
Now this is slightly better, however, when I eventually come to processing the file, I don't want to be adding \x82\xb7\xe3\x83\xa5\
to my database, I'd rather add the ガガスゴイヘンシ
. So, is there a way to handle Unicode encoded files? I've also looked at the Mozilla chardet project to try and determine encoding, but following code examples, it thinks the file is utf-8 encoded.
If you seek
into the middle of a UTF-8 sequence, the error message doesn't necessarily mean the data isn't actually UTF-8, just that you can't seek to that exact position and get a useful decoding. "Invalid start byte" means this cannot be the beginning of a valid UTF-8 string.
If you only need to retrieve the last line of the file, maybe just read the entire file and pluck off the last line, or use try
/ except
until you find a position you can safely seek to. Or simply read part or all of the file as bytes
and then decode only the last line.
import os
with open(file_path, "rb") as f: # notice "b" in "rb"
f.seek(0, os.SEEK_END)
f.seek(f.tell() -40, os.SEEK_SET)
records_bytes = f.read()
records_str = records_bytes.split(b'\n')[-2].decode('ascii')
print(records_str)
We use[-2]
on the assumption that the file contains a final newline at the end (i.e. it is a well-formed text file) and so [-1]
is simply an empty string, and this retrieves the last actual line.
(Posting this as a separate answer so as not to pollute my other answer, which I hope might also be more useful to future visitors.)