Search code examples
pythonpython-2.7encodingutf-8mojibake

Python Codecs package not able to decode byte


I am using Python 2.7.3 and BeuatofulSoup to grab data from a website's table, then using codecs to write content to a file. One of the variables I collect, occasionally has garbled characters in it. For example, if the website table looks like this

 Year    Name   City             State
 2000    John   D’Iberville    MS
 2001    Steve  Arlington        VA

So when I generate my City variable, I always encode it as utf-8:

 Year = foo.text
 Name = foo1.text
 City = foo3.text.encode('utf-8').strip()
 State = foo4.text

 RowsData = ("{0},{1},{2},{3}").format(Year, Name, City, State)

So that the contents of a list of comma separated strings I create called RowData and RowHeaders look like this

 RowHeaders = ['Year,Name,City,State']

 RowsData = ['2000, John, D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville, MS', 
            '2001, Steve, Arlington, VA']

Then I attempt to write this to a file using the following code

 file1 = codecs.open(Outfile.csv,"wb","utf8")
 file1.write(RowHeaders + u'\n')
 line = "\n".join(RowsData)
 file1.write(line + u'\r\n')
 file1.close()

and I get the following error

 Traceback (most recent call last):  
     File "HSRecruitsFBByPosition.py", line 141, in <module>
       file1.write(line + u'\r\n')

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6879: ordinal not in range(128)

I can use the csv writer package on RowsData and it works fine. For reasons that I don't want to get into, I need to use codecs to output the csv file. I can't figure out what is going on. Can anyone help me fix this issue? Thanks in advance.


Solution

  • codecs.open() encodes for you. Don't hand it encoded data, because then Python will try and decode the data for you again just so it can encode it to UTF-8. That implicit decoding uses the ASCII codec, but since you have non-ASCII data in your encoded byte string, this fails:

    >>> u'D’Iberville'.encode('utf8')
    'D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville'
    >>> u'D’Iberville'.encode('utf8').encode('utf8')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
    

    The solution is to *not encode manually:

    Year = foo.text
    Name = foo1.text
    City = foo3.text.strip()
    State = foo4.text
    

    Note that codecs.open() is not the most efficient implementation of a file stream. In Python 2.7, I'd use io.open() instead; it offers the same functionality, but implemented more robustly. The io module is the default I/O implementation for Python 3, but also available in Python 2 for forward compatibility.

    However, you appear to be re-inventing CSV handling; Python has an excellent csv module that can produce CSV files for you. In Python 2 it cannot handle Unicode however, so then you do need to encode manually:

    import csv
    
    # ...
    
    year = foo.text
    name = foo1.text
    city = foo3.text.strip()
    state = foo4.text
    
    row = [year, name, city, state]
    
    with open(Outfile.csv, "wb") as outf:
        writer = csv.writer(outf)
        writer.writerow(['Year', 'Name', 'City', 'State'])
        writer.writerow([c.encode('utf8') for c in row])
    

    Last but not least, if your HTML page produced the text D’Iberville then you produced a Mojibake; one where you misinterpreted UTF-8 as CP-1252:

    >>> u'D’Iberville'.encode('cp1252').decode('utf8')
    u'D\u2019Iberville'
    >>> print u'D’Iberville'.encode('cp1252').decode('utf8')
    D’Iberville
    

    This is usually caused by bypassing BeautifulSoup's encoding detection (pass in byte strings, not Unicode).

    You could try and 'fix' these after the fact with:

    try:
        City = City.encode('cp1252').decode('utf8')
    except UnicodeError:
        # Not a value that could be de-mojibaked, so probably
        # not a Mojibake in the first place.
        pass