Search code examples
pythonunicodeencoding

Convert numeric character reference notation to unicode string


Is there a standard, preferably Pythonic, way to convert the &#xxxx; notation to a proper unicode string?

For example,

מפגשי

Should be converted to:

מפגשי

It can be done - quite easily - using string manipulations, but I wonder if there's a standard library for this.


Solution

  • Use HTMLParser.HTMLParser():

    >>> from HTMLParser import HTMLParser
    >>> h = HTMLParser()
    >>> s = "מפגשי"
    >>> print h.unescape(s)
    מפגשי
    

    It's part of the standard library, too.


    However, if you're using Python 3, you have to import from html.parser:

    >>> from html.parser import HTMLParser
    >>> h = HTMLParser()
    >>> s = 'מפגשי'
    >>> print(h.unescape(s))
    מפגשי