python html encoding python-requests mojibake

HTTP 301 redirect url encoding issue

I'm using Python's requests.get() to get some Facebook profiles HTML. Some of them redirect the request to a new url. When this new url has special characters, such as 'á', the request.get() method enters a redirect loop until an exception is raised. I found a workaround to correct the redirect url string, found in the response header under the "Location" key, but it is far from an elegant solution.

import requests

# This case works. Response [200]
r = requests.get('https://www.facebook.com/profile.php?id=4')
print(r)

# This fails. Redirect location has special characters.
# raises requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
not_working_url = 'https://www.facebook.com/profile.php?id=100010922979377'
try:
    r = requests.get(not_working_url)
except Exception as e:
    print(e)  # Exceeded 30 redirects.

# Workaround
r = requests.get(not_working_url,
                 allow_redirects=False)
redirect_url = r.headers["Location"]
print(redirect_url)
# "https://www.facebook.com/people/TomÃ¡s-Navarro-Febre/100010922979377"
# Special character 'á' on "/Tomás_Navarro_Febre/" is displayed as 'Ã¡'.

# This fixes the string.
redirect_url = redirect_url.encode('raw_unicode_escape').decode('utf-8')
print(redirect_url)
# "https://www.facebook.com/people/Tomás-Navarro-Febre/100010922979377"

# Now it works. Response [200]
r = requests.get(redirect_url)
print(r)

There must be a better way to deal with this. I tried a bunch of different headers, and using requests.Session(), but none of them worked. Thanks in advance for any help.

Solution

Headers are normally encoded as Latin-1, so that's what requests uses to decode all headers. However, in practice, the Location header usually uses UTF-8 instead. What you are seeing then is a Mojibake, in this case UTF-8 data decoded as Latin-1.

As of requests 2.14.0 (released May 2017), the library specifically decodes the Location header as UTF-8, precisely to avoid the problem you encountered. Upgrade your requests library.

If you can't upgrade, you can subclass the Session class to 'patch' the issue locally:

class UTF8RedirectingSession(requests.Session):
    def get_redirect_target(self, resp):
        if resp.is_redirect:
            return resp.headers['location'].encode('latin1').decode('utf8')
        return None

then use

with UTF8RedirectingSession() as session:
    response = session.get(...)