Search code examples
pythonbyte-order-mark

Detect Byte Order Mark (BOM) in Python


I've found lots of posts describing how to parse/ignore BOMs but can't find anything on how to simply output a true/false as to whether a file contains a BOM. Can anyone point me in the right direction to do this in Python?


Solution

  • The simple answer is: read the first 4 bytes and look at them.

    with open("utf32le.file", "rb") as file:
        beginning = file.read(4)
        # The order of these if-statements is important
        # otherwise UTF32 LE may be detected as UTF16 LE as well
        if beginning == b'\x00\x00\xfe\xff':
            print("UTF-32 BE")
        elif beginning == b'\xff\xfe\x00\x00':
            print("UTF-32 LE")
        elif beginning[0:3] == b'\xef\xbb\xbf':
            print("UTF-8")
        elif beginning[0:2] == b'\xff\xfe':
            print("UTF-16 LE")
        elif beginning[0:2] == b'\xfe\xff':
            print("UTF-16 BE")
        else:
            print("Unknown or no BOM")
    

    The not so simple answer is:

    There may be binary files that seem to have BOM, but they might still just be binary files with data that accidentally looks like a BOM.

    Other than that you can typically treat text files without BOM as UTF-8 as well.