I'm writing a program that downloads thousands of images using the 'map' method in Python. It goes a bit like this,
def download_image(image):
save_dir = "[PATH TO SAVE IMAGES]"
image_url = image['url']
image_name = image['name']
image_data = requests.get(image_url).content
with open(os.path.join(save_dir, f"{image_name}.jpg"), 'wb') as f:
f.write(image_data)
from multiprocessing import Pool
pool = Pool(8)
downloads = pool.map(download_image, images)
pool.close()
pool.join()
I want to track the "downloads per second" of the program for (1) curiosity and (2) to optimize the number of processes required. It has been a while, but I remember hearing that accomplishing things like this is difficult due to the processes of Python's multiprocessing module operating independently.
One thought I've had (while writing this) is to simply time the program's runtime from 'Pool' creation to 'Pool' closing, and then divide this time by the number of images downloaded. Something about this approach seems unappealing, but if there are no better options I suppose it will have to do.
Even though you seem to be heading in an alternate direction (Threading), I figured I'd answer the original question anyway:
I'm going to to out on a limb, and assume you don't need the output of downloads
because you don't return anything from the function download_image
anyway. It's easy to alter this example to for example append results to a list should you need that. I'm also going to assume the order is not important also because you're not keeping the results. Given those things, I'd suggest using imap_unordered
instead of map
so you can effectively get a "message" every time a task is completed by one of the workers in the pool:
from multiprocessing import Pool
from time import time
def download_image(image):
save_dir = "[PATH TO SAVE IMAGES]"
image_url = image['url']
image_name = image['name']
image_data = requests.get(image_url).content
with open(os.path.join(save_dir, f"{image_name}.jpg"), 'wb') as f:
f.write(image_data)
if __name__ == "__main__":
# Get in the habit of never calling anything that could create a child process
#such as creating a "Pool" or simply calling "multiprocessing.Process" without
#guarding execution by "if __name__ == '__main__':". This is necessary when using
#Windows, it is needed in MacOS with python 3.8 and above, and is highly encouraged
#everywhere else
pool = Pool(8) # <- child processes are created here which can't be allowed
# to happen when this file is `import`ed (which is what
# `if __name__ == "__main__":` protects against).
completed = 0
t = time()
for result in pool.imap_unordered(download_image, images):
#`result` is unused in this case, but could easily be put to some use
completed += 1
if time() >= t+60: #once a minute
rate = completed / (time() - t)
print(f'{rate} operations per second')
t = time()
completed = 0
print("done")
pool.close()
pool.join()