I am parsing an XML file from an online source but am having troubles reading utf-8 characters. Now I have read through some of the other questions that treat a similar problem, however none of the solutions so far works. Currently the code looks like below.
class XMLParser(webapp2.RequestHandler):
def get(self):
url = fetch('some.xml.online')
xml = parseString(url.content)
vouchers = xml.getElementsByTagName("VoucherCode")
for voucher in vouchers:
if voucher.getElementsByTagName("ActivePartnership")[0].firstChild.data == "true":
coupon = Coupon()
coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data.decode('utf-8'))
coupon.prov_key = str(voucher.getElementsByTagName("Id")[0].firstChild.data)
coupon.put()
self.redirect('/admin/coupon')
The error that I get from this is displayed below. It is caused by a "ü" in the description field, which I will also need to display later on when using the data.
File "C:\Users\Vincent\Documents\www\Sparkompass\Website\main.py", line 217, in get coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data.decode('utf-8')) File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)
If I take out the description everything works as it should. In the database model definition I have defined the description as follows:
description = db.StringProperty(multiline=True)
Attempt 2
I have also tried to do it like this:
coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data).decode('utf-8')
Which also gave me:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)
Any help would be very much appreciated!
UPDATE
The XML file contains German language, meaning that many more of the characters in there are UTF-8 characters. Ideally therefore I am thinking now that it might be better to do the decoding at a higher level, e.g. at
xml = parseString(url.content)
However so far I didn't get that to work either. The aim is to get the characters in ascii because this is what GAE requires to register it as a string in the database model.
I solved the problem for now by changing the description to a TextProperty, which didn't give any error. I am aware that I won't e.g. be able to sort or filter when doing this but for the description this should be ok.
Background info: https://developers.google.com/appengine/docs/python/datastore/typesandpropertyclasses#TextProperty