I've found lots of posts describing how to parse/ignore BOMs but can't find anything on how to simply output a true/false as to whether a file contains a BOM. Can anyone point me in the right direction to do this in Python?
The simple answer is: read the first 4 bytes and look at them.
with open("utf32le.file", "rb") as file:
beginning = file.read(4)
# The order of these if-statements is important
# otherwise UTF32 LE may be detected as UTF16 LE as well
if beginning == b'\x00\x00\xfe\xff':
print("UTF-32 BE")
elif beginning == b'\xff\xfe\x00\x00':
print("UTF-32 LE")
elif beginning[0:3] == b'\xef\xbb\xbf':
print("UTF-8")
elif beginning[0:2] == b'\xff\xfe':
print("UTF-16 LE")
elif beginning[0:2] == b'\xfe\xff':
print("UTF-16 BE")
else:
print("Unknown or no BOM")
The not so simple answer is:
There may be binary files that seem to have BOM, but they might still just be binary files with data that accidentally looks like a BOM.
Other than that you can typically treat text files without BOM as UTF-8 as well.