I am trying to get the HTML content of a page with requests, but it results in UnicodeDecodeError
. The reproducible code:
import requests
import urllib
url = "https://www.unique.nl/vacature/coördinator-facilitair-(v2037635)"
Attempt 1:
requests.get(url)
Attempt 2:
requests.get(requests.utils.requote_uri(url))
Both result in UnicodeDecodeError
Attempt 3:
requests.get(urllib.parse.quote(url))
Attempt 4:
requests.get(urllib.parse.quote(url.encode("Latin-1"), ":/"))
What am I missing here. Also encoding it to utf-8
, latin1
or unicode_escape
, does not work.
Full error message:
File /usr/local/lib/python3.9/site-packages/requests/api.py:75, in get(url, params, **kwargs)
64 def get(url, params=None, **kwargs):
65 r"""Sends a GET request.
66
67 :param url: URL for the new :class:`Request` object.
(...)
72 :rtype: requests.Response
73 """
---> 75 return request('get', url, params=params, **kwargs)
File /usr/local/lib/python3.9/site-packages/requests/api.py:61, in request(method, url, **kwargs)
57 # By using the 'with' statement we are sure the session is closed, thus we
58 # avoid leaving sockets open which can trigger a ResourceWarning in some
59 # cases, and look like a memory leak in others.
60 with sessions.Session() as session:
---> 61 return session.request(method=method, url=url, **kwargs)
File /usr/local/lib/python3.9/site-packages/requests/sessions.py:542, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
537 send_kwargs = {
538 'timeout': timeout,
539 'allow_redirects': allow_redirects,
540 }
541 send_kwargs.update(settings)
--> 542 resp = self.send(prep, **send_kwargs)
544 return resp
File /usr/local/lib/python3.9/site-packages/requests/sessions.py:677, in Session.send(self, request, **kwargs)
674 if allow_redirects:
675 # Redirect resolving generator.
676 gen = self.resolve_redirects(r, request, **kwargs)
--> 677 history = [resp for resp in gen]
678 else:
679 history = []
File /usr/local/lib/python3.9/site-packages/requests/sessions.py:677, in <listcomp>(.0)
674 if allow_redirects:
675 # Redirect resolving generator.
676 gen = self.resolve_redirects(r, request, **kwargs)
--> 677 history = [resp for resp in gen]
678 else:
679 history = []
File /usr/local/lib/python3.9/site-packages/requests/sessions.py:150, in SessionRedirectMixin.resolve_redirects(self, resp, req, stream, timeout, verify, cert, proxies, yield_requests, **adapter_kwargs)
146 """Receives a Response. Returns a generator of Responses or Requests."""
148 hist = [] # keep track of history
--> 150 url = self.get_redirect_target(resp)
151 previous_fragment = urlparse(req.url).fragment
152 while url:
File /usr/local/lib/python3.9/site-packages/requests/sessions.py:116, in SessionRedirectMixin.get_redirect_target(self, resp)
114 if is_py3:
115 location = location.encode('latin1')
--> 116 return to_native_string(location, 'utf8')
117 return None
File /usr/local/lib/python3.9/site-packages/requests/_internal_utils.py:25, in to_native_string(string, encoding)
23 out = string.encode(encoding)
24 else:
---> 25 out = string.decode(encoding)
27 return out
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 29: invalid start byte
It's not the request URL that's the problem, it's the response that requests
can't parse. Here are the response headers of that URL:
HTTP/2 301
content-type: text/html; charset=utf-8
date: Tue, 27 Dec 2022 07:37:34 GMT
server: Microsoft-IIS/10.0
location: https://unique.nl/vacature/co?rdinator-facilitair-(v2037635)
content-length: 184
arr-disable-session-affinity: true
The location
header contains a URL with unencoded non-ASCII characters. That is the problem. URLs by specification may not contain non-ASCII characters. Standards conforming HTTP clients are within their right to crash on this malformed response. The URL must be percent-encoded.
Other clients may not crash because they treat the response in some other way that doesn't happen to cause a problem, but it's still the response that's deviating from the standard.