Search code examples
pythongoogle-app-engineutf-8getelementsbytagname

GAE Python: Importing UTF-8 Characters from an XML file to a database model


I am parsing an XML file from an online source but am having troubles reading utf-8 characters. Now I have read through some of the other questions that treat a similar problem, however none of the solutions so far works. Currently the code looks like below.

class XMLParser(webapp2.RequestHandler):

def get(self):

        url = fetch('some.xml.online')

        xml = parseString(url.content)

        vouchers = xml.getElementsByTagName("VoucherCode")

        for voucher in vouchers:

          if voucher.getElementsByTagName("ActivePartnership")[0].firstChild.data == "true":

            coupon = Coupon()
            coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data.decode('utf-8'))
            coupon.prov_key = str(voucher.getElementsByTagName("Id")[0].firstChild.data)
            coupon.put()
            self.redirect('/admin/coupon')

The error that I get from this is displayed below. It is caused by a "ü" in the description field, which I will also need to display later on when using the data.

File "C:\Users\Vincent\Documents\www\Sparkompass\Website\main.py", line 217, in get coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data.decode('utf-8')) File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)

If I take out the description everything works as it should. In the database model definition I have defined the description as follows:

description = db.StringProperty(multiline=True)

Attempt 2

I have also tried to do it like this:

coupon.description = str(voucher.getElementsByTagName("Description")[0].firstChild.data).decode('utf-8')

Which also gave me:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)

Any help would be very much appreciated!

UPDATE

The XML file contains German language, meaning that many more of the characters in there are UTF-8 characters. Ideally therefore I am thinking now that it might be better to do the decoding at a higher level, e.g. at

xml = parseString(url.content)

However so far I didn't get that to work either. The aim is to get the characters in ascii because this is what GAE requires to register it as a string in the database model.


Solution

  • I solved the problem for now by changing the description to a TextProperty, which didn't give any error. I am aware that I won't e.g. be able to sort or filter when doing this but for the description this should be ok.

    Background info: https://developers.google.com/appengine/docs/python/datastore/typesandpropertyclasses#TextProperty