I have a .txt file with a list of 1000+ URLs of .txt files that I need to download and then index by word. The indexing is fast, but the downloading is a huge bottleneck. I tried using urllib2 and urllib.request, but downloading a single text file with either of these libraries takes .25-.5 seconds per file (on average the files are around 600 words/3000 characters of text)
I realize I need to utilize multithreading (as a concept) at this point, but I don't know how I would go about doing this in Python. At this point I'm downloading them one at a time, and it looks like this:
with open ('urls.txt', 'r') as f: # urls.txt is the .txt file of urls, one on each line
for url in f:
response = urllib.request.urlopen(url)
data = response.read()
text = data.decode()
# .. and then index the text
This project prompt allowed me to select any language. I chose Python because I thought it would be faster. The sample output I received listed the total indexing time at around 1.5 seconds, so I assume this is approximately the benchmark they would like applicants to reach. Is it even possible to achieve such a fast runtime for this number of downloadable files in Python, on my machine (which has 4 cores)?
edit (including more information about the indexing):
Each .txt file I'm downloading contains a review of a college. Ultimately I want to index all of the terms in each review, such that when you search by term, you get back a list of all of the colleges whose reviews contain the term, and how many times that term was used in reviews for a given college. I have nested dictionaries such that the other key is the search term, and the outer value is a dictionary. The inner dictionary's term is the name of a college, and the inner dictionary's value is the number of times the term has appeared in reviews for that college.
I don't know what you are exactly doing while indexing the text, and whether data should should be exchanged between threads, or if you are writing to a single file/variable (a lock should be used), but this should work:
import threading
import urllib.request
with open('urls.txt', 'r') as f:
urls = f.read().splitlines()
def download_index(url, lock):
response = urllib.request.urlopen(url)
data = response.read()
text = data.decode()
#indexing
with lock:
#access shared resources
n = 2 #number of parallel connections
chunks = [urls[i * n:(i + 1) * n] for i in range((len(urls) + n - 1) // n )]
lock = threading.Lock()
for chunk in chunks:
threads = []
for url in chunk:
thread = threading.Thread(target=download_index, args=(url, lock,))
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
Note that you should take into account how many connections you should have at a time, as you would most likely get blocked for having 1000+ requests happening at the same time. I don't know the ideal number, but play around with the n
number and see what works. Or use proxies.
Edit: added a lock