I wrote a Wikipedia scraper in Python last week.
It scrapes French pages, so I must manage UTF-8 encoding to avoid errors. I did this with these lines at the beginning of my script:
#!/usr/bin/python
# -*- coding: utf-8 -*-
I also encode the scraped string like this:
adresse = monuments[1].get_text().encode('utf-8')
My first script worked perfectly fine with Python 2.7, but I rewrote it for Python 3 (especially to use urllib.request) and UTF-8 doesn't work anymore.
I got these errors after scraping the first few elements:
File "scraper_monu_historiques_ge_py3.py", line 19, in <module>
url = urllib.request.urlopen(url_ville).read() # et on ouvre chacune d'entre elles
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 455, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 473, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1217, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/usr/lib/python3.4/urllib/request.py", line 1174, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1090, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1118, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 975, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 58: ordinal not in range(128)
I don't understand why, because it worked fine in Python 2.7... I published a version of this WIP on Github.
You are passing a string which contain non-ASCII characters to urllib.urlopen
, which isn't a valid URI (it is a valid IRI or International Resource Identifier, though).
You need to make the IRI a valid URI before passing it to urlopen
. The specifics of this
depend on which part of the IRI contain non-ASCII characters: the domain part should be encoded using Punycode, while the path should use percent-encoding.
Since your problem is exclusively due to the path containing Unicode characters, assuming your IRI is stored in the variable iri
, you can fix it using the following:
import urllib.parse
import urllib.request
split_url = list(urllib.parse.urlsplit(iri))
split_url[2] = urllib.parse.quote(split_url[2]) # the third component is the path of the URL/IRI
url = urllib.parse.urlunsplit(split_url)
urllib.request.urlopen(url).read()
However, if you can avoid urllib
and have the option of using the requests
library instead, I would recommend doing so. The library is easier to use and has automatic IRI handling.