Search code examples
pythonpython-requestsconcurrent.futures

What is causing concurrent.futures deadlock? Code included


I have a concurrent.futures scraping script that I use for low level stuff. However, it started acting up. It gets stuck and never finishes.

I was able to narrow the problem down to 17 URLs (from a batch of 18k, you can imagine how fun that was). Something must be happening for one or more of these 17 URLs that causes a stall (deadlock?) despite me using a timeout both for the requests as well as the futures. The strange thing is that it appears it's not a single URL that is causing it. When I run the code I get logs as to what url has finished. The batch of URLs that actually finish seems to change every time, so there appears not to be a single URL that I can point to as the culprit.

Any help is welcome.

(Run the function as is. Don't use runBad = False as it expects a list of tuples.)

EDIT1: this happens with ProcessPoolExecutor as well.

EDIT2: the issue seems to be tied to Retry. When I comment out these three lines and use a plain requests.get, it finishes without a problem. But why is that? Might that be due compatibility issues between how Retry is implemented and concurrent.futures?

#    s = requests.Session()
#    retries = Retry(total=1, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False) # raise_on_status=False = místo RetryError vrátí response
#    s.mount("https://", HTTPAdapter(max_retries=retries))

EDIT3: even this simple request doesn't work. So it really has to do mounting the HTTPAdapter / max_retries. Even tried one without urllib3's Retry(), just with max_retries=2. Still didn't work. Raised an issue to see if we're not missing anything - https://github.com/psf/requests/issues/5538:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) # disabled SSL warnings
 
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
TIMEOUT = 5

s = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[503])
s.mount("https://", HTTPAdapter(max_retries=retries))
response = s.get('https://employdentllc.com', headers=HEADERS, timeout=TIMEOUT, verify=False)

This is the original concurrent.futures code:

import requests
import concurrent.futures
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from requests.exceptions import HTTPError
from requests.exceptions import SSLError
from requests.exceptions import ConnectionError
from requests.exceptions import Timeout
from requests.exceptions import TooManyRedirects
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) # disabled SSL warnings

HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
TIMEOUT = 5

def getMultiRequest(url, runBad, bad_request, tTimeout):
    #print("url = ", url)
    s = requests.Session()
    retries = Retry(total=3, backoff_factor=5, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False) # raise_on_status=False = instead of RetryError returns a response
    s.mount("https://", HTTPAdapter(max_retries=retries))
    if runBad == False:
        try:
            response = s.get(url, headers=HEADERS, timeout=tTimeout, verify=False)
           
                                            # Processing stuff // some can be pretty long (Levenstein etc)
               
            ret = (url, response.url, response.status_code, "", len(response.content), "", "", "")
        except HTTPError as e:
            ret = (url, "", e.response.status_code, "", 0, "", "", False)
        except SSLError:
            ret = (url, "", 0, "SSL certificate verification failed", 0, "", "", False)
        except ConnectionError:
            ret = (url, "", 0, "Cannot establish connection", 0, "", "", False)
        except Timeout:
            ret = (url, "", 0, "Request timed out", 0, "", "", False)
        except TooManyRedirects:
            ret = (url, "", 0, "Too many redirects", 0, "", "", False)
        except Exception:
            ret = (url, "", 0, "Undefined exception", 0, "", "", False)
        return ret
    else:
        try:
            response = s.get(url, headers=HEADERS, timeout=tTimeout, verify=False)
           
                                            # Processing stuff // some can be pretty long (Levenstein etc)
               
            ret = (url, response.url, response.status_code, "", "")
        except Exception:
            ret = (url, "", 0, "", "")
        return ret

def getMultiRequestThreaded(urlList, runBad, logURLs, tOut):
    responseList = []
    if runBad == True:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future_to_url = {executor.submit(getMultiRequest, url, runBad, "", tOut): url for url in urlList}
            for future in concurrent.futures.as_completed(future_to_url):
                url = future_to_url[future]
                try:
                    data = future.result(timeout=30)
                except Exception as exc:
                    data = (url, 0, str(type(exc)))
                finally:
                    if logURLs == True:
                        print("BAD URL done: '" + url + "'.")
                    responseList.append(data)
    else:
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future_to_url = {executor.submit(getMultiRequest, url[0], runBad, url[1], tOut): url for url in urlList}
            for future in concurrent.futures.as_completed(future_to_url):
                url = future_to_url[future][0]
                try:
                    data = future.result(timeout=30)
                except Exception as exc:
                    data = (url, 0, str(type(exc)))
                finally:
                    if logURLs == True:
                        print("LEGIT URL done: '" + url + "'.")
                    responseList.append(data)
    return responseList

URLs = [
    'https://www.appyhere.com/en-us',
    'https://jobilant.work/da',
    'https://www.iworkjobsite.com.au/jobseeker-home.htm',
    'https://youtrust.jp/lp',
    'https://passioneurs.net/ar',
    'https://employdentllc.com',
    'https://www.ivvajob.com/default/index',
    'https://praceapp.com/en',
    'https://www.safecook.be/en/home-en',
    'https://www.ns3a.com/en',
    'https://www.andjaro.com/en/home',
    'https://sweatcoin.club/',
    'https://www.pursuitae.com',
    'https://www.jobpal.ai/en',
    'https://www.clinicoin.io/en',
    'https://www.tamrecruiting.com/applicant-tracking-system-software-recruitment-management-system-talent-management-software-from-the-applicant-manager',
    'https://dott.one/index.html'
]

output = getMultiRequestThreaded(URLs, True, True, TIMEOUT)

Solution

  • I modified the program to add all the URL's to a set and as each URL's fetch was completed (for better or for worse) in the loop for future in concurrent.futures.as_completed(future_to_url):, I removed the URL from the set and printed out the current set contents. In that way when the program eventually hung I would know what remained to be completed: It was always URLs https://employdentllc.com and https://www.pursuitae.com.

    When I attempted on my own to fetch these URLs, they each returned 503 Service Unavailable errors. So when I comment out the following two lines, the program runs to completion.

    retries = Retry(total=3, backoff_factor=5, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False) # raise_on_status=False = instead of RetryError returns a response
    s.mount("https://", HTTPAdapter(max_retries=retries))
    

    It does not help just to remove code 503 from the list. There is either something else wrong in this specification (although it appears correct other than a fairly large backoff_factor, which I reduced just to make sure I was waiting long enough) or something wrong with requests or urllib3.

    Below is a printout of each result in variable output:

    ('https://www.appyhere.com/en-us', 'https://www.appyhere.com/en-us', 200, '', '')
    ('https://www.iworkjobsite.com.au/jobseeker-home.htm', 'https://www.iworkjobsite.com.au/jobseeker-home.htm', 200, '', '')
    ('https://passioneurs.net/ar', 'https://passioneurs.net/ar', 404, '', '')
    ('https://youtrust.jp/lp', 'https://youtrust.jp/lp', 200, '', '')
    ('https://jobilant.work/da', 'https://jobilant.work/da/', 200, '', '')
    ('https://employdentllc.com', 'https://employdentllc.com/', 503, '', '')
    ('https://www.ivvajob.com/default/index', 'https://www.ivvajob.com/default/index', 200, '', '')
    ('https://www.ns3a.com/en', 'https://www.ns3a.com/en', 200, '', '')
    ('https://www.safecook.be/en/home-en', 'https://www.safecook.be/en/home-en/', 200, '', '')
    ('https://sweatcoin.club/', 'https://sweatcoin.club/', 200, '', '')
    ('https://www.andjaro.com/en/home', 'https://www.andjaro.com/en/home/', 200, '', '')
    ('https://praceapp.com/en', 'https://praceapp.com/en/', 200, '', '')
    ('https://www.clinicoin.io/en', 'https://www.clinicoin.io/en', 200, '', '')
    ('https://www.jobpal.ai/en', 'https://www.jobpal.ai/en/', 200, '', '')
    ('https://dott.one/index.html', 'https://dott.one/index.html', 200, '', '')
    ('https://www.tamrecruiting.com/applicant-tracking-system-software-recruitment-management-system-talent-management-software-from-the-applicant-manager', 'https://www.tamrecruiting.com/applicant-tracking-system-software-recruitment-management-system-talent-management-software-from-the-applicant-manager', 404, '', '')
    ('https://www.pursuitae.com', 'https://www.pursuitae.com/', 503, '', '')
    

    UPDATE

    I found the problem. You need the respect_retry_after_header=False parameter:

    retries = Retry(total=3, backoff_factor=5, status_forcelist=[429, 500, 502, 503, 504], raise_on_status=False, respect_retry_after_header=False) # raise_on_status=False = instead of RetryError returns a response
    

    You might also wish to reduce the backoff_factor to 1.

    This now appears to be a duplicate of Retry for python requests module hanging.