python multithreading download python-multithreading

How to use multithreading to download 1000+ .txt files in Python quickly

I have a .txt file with a list of 1000+ URLs of .txt files that I need to download and then index by word. The indexing is fast, but the downloading is a huge bottleneck. I tried using urllib2 and urllib.request, but downloading a single text file with either of these libraries takes .25-.5 seconds per file (on average the files are around 600 words/3000 characters of text)

I realize I need to utilize multithreading (as a concept) at this point, but I don't know how I would go about doing this in Python. At this point I'm downloading them one at a time, and it looks like this:

        with open ('urls.txt', 'r') as f:                   # urls.txt is the .txt file of urls, one on each line
            for url in f:
                response = urllib.request.urlopen(url)
                data = response.read()
                text = data.decode()
                # .. and then index the text

This project prompt allowed me to select any language. I chose Python because I thought it would be faster. The sample output I received listed the total indexing time at around 1.5 seconds, so I assume this is approximately the benchmark they would like applicants to reach. Is it even possible to achieve such a fast runtime for this number of downloadable files in Python, on my machine (which has 4 cores)?

edit (including more information about the indexing):

Each .txt file I'm downloading contains a review of a college. Ultimately I want to index all of the terms in each review, such that when you search by term, you get back a list of all of the colleges whose reviews contain the term, and how many times that term was used in reviews for a given college. I have nested dictionaries such that the other key is the search term, and the outer value is a dictionary. The inner dictionary's term is the name of a college, and the inner dictionary's value is the number of times the term has appeared in reviews for that college.

Solution

I don't know what you are exactly doing while indexing the text, and whether data should should be exchanged between threads, or if you are writing to a single file/variable (a lock should be used), but this should work:

import threading
import urllib.request

with open('urls.txt', 'r') as f:
    urls = f.read().splitlines()
    
def download_index(url, lock):
    response = urllib.request.urlopen(url)
    data = response.read()
    text = data.decode()
    #indexing
    with lock:
        #access shared resources

n = 2 #number of parallel connections
chunks = [urls[i * n:(i + 1) * n] for i in range((len(urls) + n - 1) // n )]

lock = threading.Lock()
                
for chunk in chunks:
    threads = []
    for url in chunk:
        thread = threading.Thread(target=download_index, args=(url, lock,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()

Note that you should take into account how many connections you should have at a time, as you would most likely get blocked for having 1000+ requests happening at the same time. I don't know the ideal number, but play around with the n number and see what works. Or use proxies.

Edit: added a lock