I'm looking to either ignore the unicode within my xml. I'm willing to somehow change it as well in the processing of the output.
My python:
import urllib2, os, zipfile
from lxml import etree
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
target = doc.xpath('//references-cited/citation/nplcit/*/text()')
#target = '-'.join(target).replace('\n-','')
print "docID: {0}\nCitation: {1}\n".format(docID,target)
outFile.write(str(docID) +"|"+ str(target) +"\n")
Creates an output of:
docID: US-D0607176-S1-20100105
Citation: [u"\u201cThe birth of Lee Min Ho's donuts.\u201d Feb. 25, 2009. Jazzholic. Apr. 22, 2009 <http://www
But if I try to add back in the '-'join(target).replace('\n-','')
I get this error for both print
and outFile.write
:
Traceback (most recent call last):
File "C:\Documents and Settings\mine\Desktop\test_lxml.py", line 77, in <module>
print "docID: {0}\nCitation: {1}\n".format(docID,target)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
How can I ignore the unicode so I can string out target
with the outFile.write
?
You are getting this error because you have a string with unicode-characters that you are trying to output using the ascii characterset. When printing the list, you are getting the 'repr' of the lists, and the strings inside it, avoiding the problem.
You need to either encode to a different characterset (UTF-8 for instance), or strip out or replace invalid characters when encoding.
I recommend reading Joels The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), followed by the relevant chapters on encoding and decoding strings in the Python docs.
Here's a small hint to get you started:
print "docID: {0}\nCitation: {1}\n".format(docID.encode("UTF-8"),
target.encode("UTF-8"))