Search code examples
pythonpython-multiprocessing

Using multiprocessing to double the speed of working on a list


Let's say I have a list like this:

list_base = ['a','b','c','d']

If I used for xxx in list_base:, the loop would parse the list one value at a time. If I want to double the speed of this work, I'm creating a list with two values to iterate over at once and calling multiprocessing.

Basic example

Code 1 (main_code.py):

import api_values

if __name__ == '__main__':
    list_base = ['a','b','c','d']
    api_values.main(list_base)

Code 2 (api_values.py):

import multiprocessing
import datetime

def add_hour(x):
    return str(x) + ' - ' + datetime.datetime.now().strftime('%d/%m/%Y %H:%M')

def main(list_base):
    a = list_base
    a_pairs = [a[i:i+2] for i in range(0, len(a)-1, 2)]
    if (len(a) % 2) != 0:
        a_pairs.append([a[-1]])  

    final_list = []

    for a, b in a_pairs:
        mp_1 = multiprocessing.Process(target=add_hour, args=(a,))
        mp_2 = multiprocessing.Process(target=add_hour, args=(b,))
        mp_1.start()
        mp_2.start()
        mp_1.join()
        mp_2.join()
        final_list.append(mp_1)
        final_list.append(mp_2)

    print(final_list)

When I analyze the final_list print it delivers values like this:

[
<Process name='Process-1' pid=9564 parent=19136 stopped exitcode=0>, 
<Process name='Process-2' pid=5400 parent=19136 stopped exitcode=0>, 
<Process name='Process-3' pid=13396 parent=19136 stopped exitcode=0>, 
<Process name='Process-4' pid=5132 parent=19136 stopped exitcode=0>
]

I couldn't get to the return values I want conquered by calling the add_hour(x) function.

I found some answers in this question:
How can I recover the return value of a function passed to multiprocessing.Process?

But I couldn't bring to the scenario I'm using where I need the multiprocessing inside a function and not inside if __name__ == '__main__':

When trying to use it, it always generates errors in relation to the position of the created code structure, I would like some help to be able to visualize the use for my need.

Note:
This codes are a basic's examples, my real use is to extract data from an API that allows for a maximum of two simultaneous calls.

Additional code:

According to @Timus comment (You might want to look into a **Pool** and **.apply_async**), I came to this code it seems to me it worked but I don't know if it is reliable, if there is any improvement that is necessary for its use and this option is the best, feel free to update in a answer:

import multiprocessing
import datetime

final_list = []

def foo_pool(x):
    return str(x) + ' - ' + datetime.datetime.now().strftime('%d/%m/%Y %H:%M:%S')

def log_result(result):
    final_list.append(result)

def main(list_base):
    pool = multiprocessing.Pool()
    a = list_base
    a_pairs = [a[i:i+2] for i in range(0, len(a)-1, 2)]
    if (len(a) % 2) != 0:
        a_pairs.append([a[-1]])

    for a, b in a_pairs:
        pool.apply_async(foo_pool, args = (a, ), callback = log_result)
        pool.apply_async(foo_pool, args = (b, ), callback = log_result)
    pool.close()
    pool.join()

    print(final_list)

Solution

  • You don't have to use a callback: Pool.apply_async() gives you a return (an AsyncResult object) which has a .get() method to retrieve the result of the submit. Extension of your attempt:

    import time
    import multiprocessing
    import datetime
    from os import getpid
    
    def foo_pool(x):
        print(getpid())
        time.sleep(2)
        return str(x) + ' - ' + datetime.datetime.now().strftime('%d/%m/%Y %H:%M:%S')
    
    def main(list_base):
        a = list_base
        a_pairs = [a[i:i+2] for i in range(0, len(a)-1, 2)]
        if (len(a) % 2) != 0:
            a_pairs.append([a[-1]])  
    
        final_list = []
        with multiprocessing.Pool(processes=2) as pool:
            for a, b in a_pairs:
                res_1 = pool.apply_async(foo_pool, args=(a,))
                res_2 = pool.apply_async(foo_pool, args=(b,))
                final_list.extend([res_1.get(), res_2.get()])
    
        print(final_list)
    
    if __name__ == '__main__':
        list_base = ['a','b','c','d']
        start = time.perf_counter()
        main(list_base)
        end = time.perf_counter()
        print(end - start)
    

    I have added the print(getpid()) to foo_pool to show that you're actually using different processes. And I've used time to illustrate that despite the time.sleep(2) in foo_pool the overall duration of main isn't much more than 2 seconds.