Search code examples
pythonmultiprocessingurllibpython-multiprocessing

request.urlretrieve in multiprocessing Python gets stuck


I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.

The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.

Here is the code that I am using

...
import multiprocessing as mp

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.

Thanks in advance


Solution

  • Ok, I have found an answer.

    A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.

    And now, the issue no longer bothers me.

    Here is my complete code

    ...
    import multiprocessing as mp
    
    import socket
    
    # Set the default timeout in seconds
    timeout = 20
    socket.setdefaulttimeout(timeout)
    
    def getImages(val):
    
        #Dowload images
        try:
            url= # preprocess the url from the input val
            local= #Filename Generation From Global Varables And Rand Stuffs...
            urllib.request.urlretrieve(url,local)
            print("DONE - " + url)
            return 1
        except Exception as e:
            print("CAN'T DOWNLOAD - " + url )
            return 0
    
    if __name__ == '__main__':
    
        files = "urls.txt"
        lst = list(open(files))
        lst = [l.replace("\n", "") for l in lst]
    
        pool = mp.Pool(processes=4)
        res = pool.map(getImages, lst)
    
        print ("tempw")
    

    Hope this solution helps others who are facing the same issue