Search code examples
pythonmultithreadingdownloadpython-multithreading

How to use multithreading to download 1000+ .txt files in Python quickly


I have a .txt file with a list of 1000+ URLs of .txt files that I need to download and then index by word. The indexing is fast, but the downloading is a huge bottleneck. I tried using urllib2 and urllib.request, but downloading a single text file with either of these libraries takes .25-.5 seconds per file (on average the files are around 600 words/3000 characters of text)

I realize I need to utilize multithreading (as a concept) at this point, but I don't know how I would go about doing this in Python. At this point I'm downloading them one at a time, and it looks like this:

        with open ('urls.txt', 'r') as f:                   # urls.txt is the .txt file of urls, one on each line
            for url in f:
                response = urllib.request.urlopen(url)
                data = response.read()
                text = data.decode()
                # .. and then index the text

This project prompt allowed me to select any language. I chose Python because I thought it would be faster. The sample output I received listed the total indexing time at around 1.5 seconds, so I assume this is approximately the benchmark they would like applicants to reach. Is it even possible to achieve such a fast runtime for this number of downloadable files in Python, on my machine (which has 4 cores)?

edit (including more information about the indexing):

Each .txt file I'm downloading contains a review of a college. Ultimately I want to index all of the terms in each review, such that when you search by term, you get back a list of all of the colleges whose reviews contain the term, and how many times that term was used in reviews for a given college. I have nested dictionaries such that the other key is the search term, and the outer value is a dictionary. The inner dictionary's term is the name of a college, and the inner dictionary's value is the number of times the term has appeared in reviews for that college.


Solution

  • I don't know what you are exactly doing while indexing the text, and whether data should should be exchanged between threads, or if you are writing to a single file/variable (a lock should be used), but this should work:

    import threading
    import urllib.request
    
    with open('urls.txt', 'r') as f:
        urls = f.read().splitlines()
        
    def download_index(url, lock):
        response = urllib.request.urlopen(url)
        data = response.read()
        text = data.decode()
        #indexing
        with lock:
            #access shared resources
    
    n = 2 #number of parallel connections
    chunks = [urls[i * n:(i + 1) * n] for i in range((len(urls) + n - 1) // n )]
    
    lock = threading.Lock()
                    
    for chunk in chunks:
        threads = []
        for url in chunk:
            thread = threading.Thread(target=download_index, args=(url, lock,))
            thread.start()
            threads.append(thread)
        for thread in threads:
            thread.join()
    

    Note that you should take into account how many connections you should have at a time, as you would most likely get blocked for having 1000+ requests happening at the same time. I don't know the ideal number, but play around with the n number and see what works. Or use proxies.

    Edit: added a lock