I am trying to get data from different APIs. They are received in JSON format, stored in SQLite and afterwards parsed.
The issue I am having is that when sending many requests I eventually receive an error, even if I am using time.sleep
between requests.
My code looks like the one below, where this would be inside a loop and the url to be opened would be changing:
base_url = 'https://www.whateversite.com/api/index.php?'
custom_url = 'variable_text1' + & + 'variable_text2'
url = base_url + custom_urls #url will be changing
time.sleep(1)
data = urllib2.urlopen(url).read()
This runs thousands of times inside the loop. The problem comes after the script has been running for a while (up to a couple of hours), then I get the following errors or similar ones:
data = urllib2.urlopen(url).read()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
or
uh = urllib.urlopen(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 437, in open_https
h.endheaders(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 969, in endheaders
self._send_output(message_body)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 829, in _send_output
self.send(msg)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 791, in send
self.connect()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1172, in connect
self.timeout, self.source_address)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 8] nodename nor servname provided, or not known
I believe this happens because the modules throw an error at some point if used too often in a short period of time.
From what I've read in many different threads about which module is better, I think for my needs all would work and the main lever to choose one is that it can open as many url's as possible. In my experience, urllib
and urllib2
are better than requests
for that, as requests
crashed in less time.
Assuming that I do not want to increase the waiting time used in time.sleep
, these are the solutions I thought so far:
I thought of combining all different modules. That would be:
requests
.urllib2
httplib2
or urllib
) or back to requests
Use try .. except
block to handle that exception, as suggested here.
I've also read about sending multiple requests at once or in parallel. I don't know how that exactly works and if it could actually be useful
However, I am not convinced about any of these solutions.
Can you think of any more elegant and/or efficient solution to deal with this error?
I'm with Python 2.7
Even if I was not convinced, I ended up trying to implement the try .. except
block and I'm quite happy with the result:
for url in list_of_urls:
time.sleep(2)
try:
response = urllib2.urlopen(url)
data = response.read()
time.sleep(0.1)
response.close() #as suggested by zachyee in the comments
#code to save data in SQLite database
except urllib2.URLError as e:
print '***** urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known> *****'
#save error in SQLite
cur.execute('''INSERT INTO Errors (error_type, error_ts, url_queried)
VALUES (?, ?, ?)''', ('urllib2.URLError', timestamp, url))
conn.commit()
time.sleep(30) #give it a small break
The script did run until the end.
From thousands of requests I got 8 errors, that were saved in my database with its related URL. This way I can try to retrieve those url's again afterwards, if needed.