Search code examples
pythonpython-3.xpython-requestsurl-encodingmojibake

Python Requests Strange URL %-Encoding


Using Python 3.6.1, Requests 2.13.0, I am getting strange encoding of the URL being requested. I have a URL with Chinese characters in the query string, for example huà 話 用, which should %-encode to hu%C3%A0%20%E8%A9%B1%20%E7%94%A8 or even hu%C3%A0+%E8%A9%B1+%E7%94%A8, but for some reason it is %-encoding to hu%C3%83%C2%A0%20%C3%A8%C2%A9%C2%B1%20%C3%A7%C2%94%C2%A8. This is not correct. I've been using http://r12a.github.io/apps/conversion/ page to help me work the encodings. I've also used JavaScript encodeURI and PHP urlencode and don't get anything near what I see the Requests library doing.

Am I doing something wrong such that the encoding is so far off?

UPDATE: I looked into Mojibake encoding and dug into the Requests library a little more and found out what the problem is, but I'm still not sure how to fix it.

I'm making a call against an internal server, using a simple .get(url) call. The call goes to the server and gets a redirect response. The redirect page has a meta charset="UTF-8" in the header and the URL listed in it is correct. The location header leaving the server is ok; it is encoded as UTF-8 and the Content-Type header has a charset=UTF-8 on it. However, when I debug the redirect response in Python I note that the header on the response object is incorrect, it doesn't seem to be decoded correctly. The headers property contains this in location: huÃ\xa0 話 ç\x94. As said above, it should be decoded as: huà 話 用. So, that strange URL query string get's % encoded by Requests and set back to the server, which then rejects that URL (obviously).

Is there something I can do to prevent Requests from messing this up? Or get it to correctly decode the location header? Web browsers don't seem to have trouble with this.


Solution

  • You have a Mojibake encoding. The bytes encoded are those of the Latin-1 interpretation of the UTF-8 bytes:

    >>> from urllib.parse import quote
    >>> text = 'huà 話 用'
    >>> quote(text)
    'hu%C3%A0%20%E8%A9%B1%20%E7%94%A8'
    >>> quote(text.encode('utf8').decode('latin1'))
    'hu%C3%83%C2%A0%20%C3%A8%C2%A9%C2%B1%20%C3%A7%C2%94%C2%A8'
    

    You can reverse the process by manually encoding to Latin-1 again, then decoding from UTF-8:

    >>> unquote('hu%C3%83%C2%A0%20%C3%A8%C2%A9%C2%B1%20%C3%A7%C2%94%C2%A8').encode('latin1').decode('utf8')
    'huà 話 用'
    

    or you could use the ftfy library to automate fixing the wrong encoding (ftfy usually does a much better job, especially when Windows codepages are involved in the Mojibake):

    >>> from ftfy import fix_text
    >>> fix_text(unquote('hu%C3%83%C2%A0%20%C3%A8%C2%A9%C2%B1%20%C3%A7%C2%94%C2%A8'))
    'huà 話 用'
    

    You said this about the source of the URL:

    The location header leaving the server is ok; it is encoded as UTF-8

    That's your problem, right there. HTTP headers are always encoded as Latin-1(*). The server MUST set the Location header to a fully percent-encoded URL, so that all UTF-8 bytes are represented as %HH escape sequences. These are just ASCII characters, perfectly save in a Latin-1 context.

    If your server sends the header as un-escaped UTF-8 bytes, then HTTP clients (including requests) will decode that as Latin-1 instead resulting in the exact Mojibake problem you observed. And because the URL contains invalid URL characters, requests escapes the Mojibake result to the percent-encoded version.


    (*) Actually, the Location header should be an absoluteURI as per RFC2396 which is always ASCII (7-bit) clean data, but because some other HTTP headers allow for 'descriptive' text, Latin-1 (ISO-8859-1) is the accepted default encoding for header data. See the TEXT rule in section 2.2 of the HTTP/1.1 RFC, and the http.client module that ultimately decodes the headers for requests follows this RFC in this regard when decoding non-ASCII data in any header. You can provide non-Latin-1 data only if wrapped as per the Message Header Extensions RFC 2047, but this doesn't apply to the Location header.