Search code examples
pythonweb-scrapingunicodepython-requestsurllib

Url with umlaut (non ascii), dashes and brackets is not parseable by requests


I am trying to get the HTML content of a page with requests, but it results in UnicodeDecodeError. The reproducible code:

import requests
import urllib

url = "https://www.unique.nl/vacature/coördinator-facilitair-(v2037635)"

Attempt 1:

requests.get(url)

Attempt 2:

requests.get(requests.utils.requote_uri(url))

Both result in UnicodeDecodeError

Attempt 3:

requests.get(urllib.parse.quote(url))

Attempt 4:

requests.get(urllib.parse.quote(url.encode("Latin-1"), ":/"))

What am I missing here. Also encoding it to utf-8, latin1 or unicode_escape, does not work.

Full error message:

File /usr/local/lib/python3.9/site-packages/requests/api.py:75, in get(url, params, **kwargs)
     64 def get(url, params=None, **kwargs):
     65     r"""Sends a GET request.
     66
     67     :param url: URL for the new :class:`Request` object.
   (...)
     72     :rtype: requests.Response
     73     """
---> 75     return request('get', url, params=params, **kwargs)

File /usr/local/lib/python3.9/site-packages/requests/api.py:61, in request(method, url, **kwargs)
     57 # By using the 'with' statement we are sure the session is closed, thus we
     58 # avoid leaving sockets open which can trigger a ResourceWarning in some
     59 # cases, and look like a memory leak in others.
     60 with sessions.Session() as session:
---> 61     return session.request(method=method, url=url, **kwargs)

File /usr/local/lib/python3.9/site-packages/requests/sessions.py:542, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    537 send_kwargs = {
    538     'timeout': timeout,
    539     'allow_redirects': allow_redirects,
    540 }
    541 send_kwargs.update(settings)
--> 542 resp = self.send(prep, **send_kwargs)
    544 return resp

File /usr/local/lib/python3.9/site-packages/requests/sessions.py:677, in Session.send(self, request, **kwargs)
    674 if allow_redirects:
    675     # Redirect resolving generator.
    676     gen = self.resolve_redirects(r, request, **kwargs)
--> 677     history = [resp for resp in gen]
    678 else:
    679     history = []

File /usr/local/lib/python3.9/site-packages/requests/sessions.py:677, in <listcomp>(.0)
    674 if allow_redirects:
    675     # Redirect resolving generator.
    676     gen = self.resolve_redirects(r, request, **kwargs)
--> 677     history = [resp for resp in gen]
    678 else:
    679     history = []

File /usr/local/lib/python3.9/site-packages/requests/sessions.py:150, in SessionRedirectMixin.resolve_redirects(self, resp, req, stream, timeout, verify, cert, proxies, yield_requests, **adapter_kwargs)
    146 """Receives a Response. Returns a generator of Responses or Requests."""
    148 hist = []  # keep track of history
--> 150 url = self.get_redirect_target(resp)
    151 previous_fragment = urlparse(req.url).fragment
    152 while url:

File /usr/local/lib/python3.9/site-packages/requests/sessions.py:116, in SessionRedirectMixin.get_redirect_target(self, resp)
    114     if is_py3:
    115         location = location.encode('latin1')
--> 116     return to_native_string(location, 'utf8')
    117 return None

File /usr/local/lib/python3.9/site-packages/requests/_internal_utils.py:25, in to_native_string(string, encoding)
     23         out = string.encode(encoding)
     24     else:
---> 25         out = string.decode(encoding)
     27 return out

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 29: invalid start byte

Solution

  • It's not the request URL that's the problem, it's the response that requests can't parse. Here are the response headers of that URL:

    HTTP/2 301 
    content-type: text/html; charset=utf-8
    date: Tue, 27 Dec 2022 07:37:34 GMT
    server: Microsoft-IIS/10.0
    location: https://unique.nl/vacature/co?rdinator-facilitair-(v2037635)
    content-length: 184
    arr-disable-session-affinity: true
    

    The location header contains a URL with unencoded non-ASCII characters. That is the problem. URLs by specification may not contain non-ASCII characters. Standards conforming HTTP clients are within their right to crash on this malformed response. The URL must be percent-encoded.

    Other clients may not crash because they treat the response in some other way that doesn't happen to cause a problem, but it's still the response that's deviating from the standard.