UTF-8 encoding issue with Python 3

I wrote a Wikipedia scraper in Python last week.

It scrapes French pages, so I must manage UTF-8 encoding to avoid errors. I did this with these lines at the beginning of my script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

I also encode the scraped string like this:

adresse = monuments[1].get_text().encode('utf-8')

My first script worked perfectly fine with Python 2.7, but I rewrote it for Python 3 (especially to use urllib.request) and UTF-8 doesn't work anymore.

I got these errors after scraping the first few elements:

File "scraper_monu_historiques_ge_py3.py", line 19, in <module>
    url = urllib.request.urlopen(url_ville).read() # et on ouvre chacune d'entre elles
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 455, in open
    response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 473, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
    result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1217, in https_open
    context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.4/urllib/request.py", line 1174, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1090, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 975, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 58: ordinal not in range(128)

I don't understand why, because it worked fine in Python 2.7... I published a version of this WIP on Github.

Solution

You are passing a string which contain non-ASCII characters to urllib.urlopen, which isn't a valid URI (it is a valid IRI or International Resource Identifier, though).

You need to make the IRI a valid URI before passing it to urlopen. The specifics of this depend on which part of the IRI contain non-ASCII characters: the domain part should be encoded using Punycode, while the path should use percent-encoding.

Since your problem is exclusively due to the path containing Unicode characters, assuming your IRI is stored in the variable iri, you can fix it using the following:

import urllib.parse
import urllib.request

split_url = list(urllib.parse.urlsplit(iri))
split_url[2] = urllib.parse.quote(split_url[2])    # the third component is the path of the URL/IRI
url = urllib.parse.urlunsplit(split_url)

urllib.request.urlopen(url).read()

However, if you can avoid urllib and have the option of using the requests library instead, I would recommend doing so. The library is easier to use and has automatic IRI handling.