python python-2.7 encoding utf-8 mojibake

Python Codecs package not able to decode byte

I am using Python 2.7.3 and BeuatofulSoup to grab data from a website's table, then using codecs to write content to a file. One of the variables I collect, occasionally has garbled characters in it. For example, if the website table looks like this

 Year    Name   City             State
 2000    John   Dâ€™Iberville    MS
 2001    Steve  Arlington        VA

So when I generate my City variable, I always encode it as utf-8:

 Year = foo.text
 Name = foo1.text
 City = foo3.text.encode('utf-8').strip()
 State = foo4.text

 RowsData = ("{0},{1},{2},{3}").format(Year, Name, City, State)

So that the contents of a list of comma separated strings I create called RowData and RowHeaders look like this

 RowHeaders = ['Year,Name,City,State']

 RowsData = ['2000, John, D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville, MS', 
            '2001, Steve, Arlington, VA']

Then I attempt to write this to a file using the following code

 file1 = codecs.open(Outfile.csv,"wb","utf8")
 file1.write(RowHeaders + u'\n')
 line = "\n".join(RowsData)
 file1.write(line + u'\r\n')
 file1.close()

and I get the following error

 Traceback (most recent call last):  
     File "HSRecruitsFBByPosition.py", line 141, in <module>
       file1.write(line + u'\r\n')

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6879: ordinal not in range(128)

I can use the csv writer package on RowsData and it works fine. For reasons that I don't want to get into, I need to use codecs to output the csv file. I can't figure out what is going on. Can anyone help me fix this issue? Thanks in advance.

Solution

codecs.open() encodes for you. Don't hand it encoded data, because then Python will try and decode the data for you again just so it can encode it to UTF-8. That implicit decoding uses the ASCII codec, but since you have non-ASCII data in your encoded byte string, this fails:

>>> u'Dâ€™Iberville'.encode('utf8')
'D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville'
>>> u'Dâ€™Iberville'.encode('utf8').encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

The solution is to *not encode manually:

Year = foo.text
Name = foo1.text
City = foo3.text.strip()
State = foo4.text

Note that codecs.open() is not the most efficient implementation of a file stream. In Python 2.7, I'd use io.open() instead; it offers the same functionality, but implemented more robustly. The io module is the default I/O implementation for Python 3, but also available in Python 2 for forward compatibility.

However, you appear to be re-inventing CSV handling; Python has an excellent csv module that can produce CSV files for you. In Python 2 it cannot handle Unicode however, so then you do need to encode manually:

import csv

# ...

year = foo.text
name = foo1.text
city = foo3.text.strip()
state = foo4.text

row = [year, name, city, state]

with open(Outfile.csv, "wb") as outf:
    writer = csv.writer(outf)
    writer.writerow(['Year', 'Name', 'City', 'State'])
    writer.writerow([c.encode('utf8') for c in row])

Last but not least, if your HTML page produced the text Dâ€™Iberville then you produced a Mojibake; one where you misinterpreted UTF-8 as CP-1252:

>>> u'Dâ€™Iberville'.encode('cp1252').decode('utf8')
u'D\u2019Iberville'
>>> print u'Dâ€™Iberville'.encode('cp1252').decode('utf8')
D’Iberville

This is usually caused by bypassing BeautifulSoup's encoding detection (pass in byte strings, not Unicode).

You could try and 'fix' these after the fact with:

try:
    City = City.encode('cp1252').decode('utf8')
except UnicodeError:
    # Not a value that could be de-mojibaked, so probably
    # not a Mojibake in the first place.
    pass