Search code examples
pythonpython-requestsurllib2urllibhttplib2

Multiple URL requests to API without getting error from urllib2 or requests


I am trying to get data from different APIs. They are received in JSON format, stored in SQLite and afterwards parsed.

The issue I am having is that when sending many requests I eventually receive an error, even if I am using time.sleep between requests.

Usual approach

My code looks like the one below, where this would be inside a loop and the url to be opened would be changing:

base_url = 'https://www.whateversite.com/api/index.php?'
custom_url = 'variable_text1' + & + 'variable_text2' 

url = base_url + custom_urls #url will be changing

time.sleep(1)
data = urllib2.urlopen(url).read() 

This runs thousands of times inside the loop. The problem comes after the script has been running for a while (up to a couple of hours), then I get the following errors or similar ones:

    data = urllib2.urlopen(url).read()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1222, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

or

    uh = urllib.urlopen(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 437, in open_https
    h.endheaders(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 829, in _send_output
    self.send(msg)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 791, in send
    self.connect()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1172, in connect
    self.timeout, self.source_address)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 8] nodename nor servname provided, or not known

I believe this happens because the modules throw an error at some point if used too often in a short period of time.

From what I've read in many different threads about which module is better, I think for my needs all would work and the main lever to choose one is that it can open as many url's as possible. In my experience, urllib and urllib2 are better than requests for that, as requests crashed in less time.

Assuming that I do not want to increase the waiting time used in time.sleep, these are the solutions I thought so far:

Possible solutions?

A

I thought of combining all different modules. That would be:

  • Start, for instance, with requests.
  • After a specific time or when error is thrown, switch automatically to urllib2
  • After a specific time or when error is thrown, switch automatically to other modules (such as httplib2 or urllib) or back to requests
  • And so on...

B

Use try .. except block to handle that exception, as suggested here.

C

I've also read about sending multiple requests at once or in parallel. I don't know how that exactly works and if it could actually be useful


However, I am not convinced about any of these solutions.

Can you think of any more elegant and/or efficient solution to deal with this error?

I'm with Python 2.7


Solution

  • Even if I was not convinced, I ended up trying to implement the try .. except block and I'm quite happy with the result:

    for url in list_of_urls:
        time.sleep(2)
        try:
            response = urllib2.urlopen(url)
            data = response.read()
            time.sleep(0.1)
            response.close() #as suggested by zachyee in the comments
    
            #code to save data in SQLite database
    
        except urllib2.URLError as e:
            print '***** urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known> *****'
            #save error in SQLite
            cur.execute('''INSERT INTO Errors (error_type, error_ts, url_queried)
            VALUES (?, ?, ?)''', ('urllib2.URLError', timestamp, url))
            conn.commit()
            time.sleep(30) #give it a small break
    

    The script did run until the end.

    From thousands of requests I got 8 errors, that were saved in my database with its related URL. This way I can try to retrieve those url's again afterwards, if needed.