Search code examples
pythonunicodeutf-8

u'\ufeff' in Python string


I got an error with the following exception message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
position 155: ordinal not in range(128)

Not sure what u'\ufeff' is, it shows up when I'm web scraping. How can I remedy the situation? The .replace() string method doesn't work on it.


Solution

  • The Unicode character U+FEFF is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:

    #!python2
    #coding: utf8
    u = u'ABC'
    e8 = u.encode('utf-8')        # encode without BOM
    e8s = u.encode('utf-8-sig')   # encode with BOM
    e16 = u.encode('utf-16')      # encode with BOM
    e16le = u.encode('utf-16le')  # encode without BOM
    e16be = u.encode('utf-16be')  # encode without BOM
    print 'utf-8     %r' % e8
    print 'utf-8-sig %r' % e8s
    print 'utf-16    %r' % e16
    print 'utf-16le  %r' % e16le
    print 'utf-16be  %r' % e16be
    print
    print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
    print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
    print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
    print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')
    

    Note that EF BB BF is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).

    Output:

    utf-8     'ABC'
    utf-8-sig '\xef\xbb\xbfABC'
    utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
    utf-16le  'A\x00B\x00C\x00'
    utf-16be  '\x00A\x00B\x00C'
    
    utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
    utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
    utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
    utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.
    

    Note that the utf-16 codec requires BOM to be present, or Python won't know if the data is big- or little-endian.