I am using Python 2.7.3 and BeuatofulSoup to grab data from a website's table, then using codecs
to write content to a file. One of the variables I collect, occasionally has garbled characters in it. For example, if the website table looks like this
Year Name City State
2000 John D’Iberville MS
2001 Steve Arlington VA
So when I generate my City
variable, I always encode it as utf-8
:
Year = foo.text
Name = foo1.text
City = foo3.text.encode('utf-8').strip()
State = foo4.text
RowsData = ("{0},{1},{2},{3}").format(Year, Name, City, State)
So that the contents of a list of comma separated strings I create called RowData
and RowHeaders
look like this
RowHeaders = ['Year,Name,City,State']
RowsData = ['2000, John, D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville, MS',
'2001, Steve, Arlington, VA']
Then I attempt to write this to a file using the following code
file1 = codecs.open(Outfile.csv,"wb","utf8")
file1.write(RowHeaders + u'\n')
line = "\n".join(RowsData)
file1.write(line + u'\r\n')
file1.close()
and I get the following error
Traceback (most recent call last):
File "HSRecruitsFBByPosition.py", line 141, in <module>
file1.write(line + u'\r\n')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6879: ordinal not in range(128)
I can use the csv writer package on RowsData
and it works fine. For reasons that I don't want to get into, I need to use codecs to output the csv file. I can't figure out what is going on. Can anyone help me fix this issue? Thanks in advance.
codecs.open()
encodes for you. Don't hand it encoded data, because then Python will try and decode the data for you again just so it can encode it to UTF-8. That implicit decoding uses the ASCII codec, but since you have non-ASCII data in your encoded byte string, this fails:
>>> u'D’Iberville'.encode('utf8')
'D\xc3\xa2\xe2\x82\xac\xe2\x84\xa2Iberville'
>>> u'D’Iberville'.encode('utf8').encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
The solution is to *not encode manually:
Year = foo.text
Name = foo1.text
City = foo3.text.strip()
State = foo4.text
Note that codecs.open()
is not the most efficient implementation of a file stream. In Python 2.7, I'd use io.open()
instead; it offers the same functionality, but implemented more robustly. The io
module is the default I/O implementation for Python 3, but also available in Python 2 for forward compatibility.
However, you appear to be re-inventing CSV handling; Python has an excellent csv
module that can produce CSV files for you. In Python 2 it cannot handle Unicode however, so then you do need to encode manually:
import csv
# ...
year = foo.text
name = foo1.text
city = foo3.text.strip()
state = foo4.text
row = [year, name, city, state]
with open(Outfile.csv, "wb") as outf:
writer = csv.writer(outf)
writer.writerow(['Year', 'Name', 'City', 'State'])
writer.writerow([c.encode('utf8') for c in row])
Last but not least, if your HTML page produced the text D’Iberville
then you produced a Mojibake; one where you misinterpreted UTF-8 as CP-1252:
>>> u'D’Iberville'.encode('cp1252').decode('utf8')
u'D\u2019Iberville'
>>> print u'D’Iberville'.encode('cp1252').decode('utf8')
D’Iberville
This is usually caused by bypassing BeautifulSoup's encoding detection (pass in byte strings, not Unicode).
You could try and 'fix' these after the fact with:
try:
City = City.encode('cp1252').decode('utf8')
except UnicodeError:
# Not a value that could be de-mojibaked, so probably
# not a Mojibake in the first place.
pass