Search code examples
unicodepython-2.7escapingcherrypymako

Decoding URL containing unicode characters


I have the following code in a Mako template:

<a href="#" onclick='getCompanyHTML("${fund.investments[inv_name].name | u}"); return false;'>${inv_name}</a>

This applies url escaping to the name string of an object representing a company. The resulting escaped string is then used in a url. The Mako documentation states that url encoding is provided using urllib.quote_plus(string.encode('utf-8')).

On the server I receive the company name part into the argument investment_name:

def Investment(client, fund_name, investment_name, **kwargs):
    client          = urllib.unquote_plus(client)
    fund_name       = urllib.unquote_plus(fund_name)
    investment_name = urllib.unquote_plus(investment_name)

I then use investment_name as a key back in to the same dictionary from which it was extracted in the template.

This works fine for all the standard cases, such as spaces, slashes, and single quotes in the company name. However, it fails if the company name contains unicode characters outside of the ascii character set.

For instance, the url for company name "Eptisa Servicios de Ingeniería S.L." is rendered as "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L." When this value arrives back at the server, I'm reversing the url escaping but clearly failing to decode the unicode properly because my attempt to use the result as a dictionary key generates a key error.

I've tried adding unicode decoding in these two forms, without luck:

    investment_name = urllib.unquote_plus(investment_name.decode('utf-8'))
    investment_name = urllib.unquote_plus(investment_name.encode('raw_unicode_escape').decode('utf-8'))

Can anyone suggest what I must do to "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L." to turn it back into "Eptisa Servicios de Ingeniería S.L."?


Solution

  • Do it in the reverse order: first unquote then .decode('utf-8')

    Do not mix bytes and Unicode strings.

    Example

    import urllib
    
    q = "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L."
    b = urllib.unquote_plus(q)
    u = b.decode("utf-8")
    print u
    

    Note: print u might produce UnicodeEncodeError. To fix it:

    print u.encode(character_encoding_your_console_understands)
    

    Or set PYTHONIOENCODING environment variable.

    On Unix you could try locale.getpreferredencoding() as character encoding, on Windows see output of chcp