Search code examples
pythonunicode

Reading Unicode file data with BOM chars in Python


I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:

bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']

infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()

print(data)

As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>

I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?


Solution

  • There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:

    # Standard UTF-8 without BOM
    >>> b'hello'.decode('utf-8')
    'hello'
    >>> b'hello'.decode('utf-8-sig')
    'hello'
    
    # BOM encoded UTF-8
    >>> b'\xef\xbb\xbfhello'.decode('utf-8')
    '\ufeffhello'
    >>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
    'hello'
    

    In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it