Search code examples
pythonunicodeminidom

How to parse unicode strings with minidom?


I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)

It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?


Solution

  • Try to decode it:

    > print u'abcdé'.encode('utf-8')
    > abcdé
    
    > print u'abcdé'.encode('utf-8').decode('utf-8')
    > abcdé