I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet
, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?
There is no reason to check if a BOM exists or not, utf-8-sig
manages that for you and behaves exactly as utf-8
if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig
correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig
and not worry about it