Search code examples
url-encodingpython-unicode

python url unquote followed by unicode decode


I have a unicode string like '%C3%A7%C3%B6asd+fjkls%25asd' and I want to decode this string.
I used urllib.unquote_plus(str) but it works wrong.

  • expected : çöasd+fjkls%asd
  • result : çöasd fjkls%asd

double coded utf-8 characters(%C3%A7 and %C3%B6) are decoded wrong.
My python version is 2.7 under a linux distro. What is the best way to get expected result?


Solution

  • You have 3 or 4 or 5 problems ... but repr() and unicodedata.name() are your friends; they unambiguously show you exactly what you have got, without the confusion engendered by people with different console encodings communicating the results of print fubar.

    Summary: either (a) you start with a unicode object and apply the unquote function to that or (b) you start off with a str object and your console encoding is not UTF-8.

    If as you say you start off with a unicode object:

    >>> s0 = u'%C3%A7%C3%B6asd+fjkls%25asd'
    >>> print repr(s0)
    u'%C3%A7%C3%B6asd+fjkls%25asd'
    

    this is an accidental nonsense. If you apply urllibX.unquote_YYYY() to it, you get another nonsense unicode object (u'\xc3\xa7\xc3\xb6asd+fjkls%asd') which would cause your shown symptoms when printed. You should convert your original unicode object to a str object immediately:

    >>> s1 = s0.encode('ascii')
    >>> print repr(s1)
    '%C3%A7%C3%B6asd+fjkls%25asd'
    

    then you should unquote it:

    >>> import urllib2
    >>> s2 = urllib2.unquote(s1)
    >>> print repr(s2)
    '\xc3\xa7\xc3\xb6asd+fjkls%asd'
    

    Looking at the first 4 bytes of that, it's encoded in UTF-8. If you do print s2, it will look OK if your console is expecting UTF-8, but if it's expecting ISO-8859-1 (aka latin1) you'll see your symptomatic rubbish (first char will be A-tilde). Let's park that thought for a moment and convert it to a Unicode object:

    >>> s3 = s2.decode('utf8')
    >>> print repr(s3)
    u'\xe7\xf6asd+fjkls%asd'
    

    and inspect it to see what we've actually got:

    >>> import unicodedata
    >>> for c in s3[:6]:
    ...     print repr(c), unicodedata.name(c)
    ...
    u'\xe7' LATIN SMALL LETTER C WITH CEDILLA
    u'\xf6' LATIN SMALL LETTER O WITH DIAERESIS
    u'a' LATIN SMALL LETTER A
    u's' LATIN SMALL LETTER S
    u'd' LATIN SMALL LETTER D
    u'+' PLUS SIGN
    

    Looks like what you said you expected. Now we come to the question of displaying it on your console. Note: don't freak out when you see "cp850"; I'm doing this portably and just happen to be doing this in a Command Prompt on Windows.

    >>> import sys
    >>> sys.stdout.encoding
    'cp850'
    >>> print s3
    çöasd+fjkls%asd
    

    Note: the unicode object was explicitly encoded using sys.stdout.encoding. Fortunately all the unicode characters in s3 are representable in that encoding (and cp1252 and latin1).