Search code examples
pythonmultithreadingmultiprocessingconcurrent.futuresprocess-pool

Python multithreading (Concurrent Futures) resutls in recursive results, how to set multithreading properly?


I put together the following Python script which uses multithreading to execute a function which returns a dictionary (my actual application is for loading and parsing - but simplified it here to a string operation to make it simpler to show).

The only way I found to get the multithreading to work in Windows is to use if "__main__" == __name__: before the execution. However, this seems to create an issue where anything after the actual function gets repeated multiple times, even as it is outside the function or portion of the script executing.

How do I update the script so that I don't get this recursivity ? (I want the function to return the dictionary only once). What I am doing wrong ?

Here is my repurposed Script:

import concurrent.futures
from itertools import product
from time import process_time

# This function generates a dictionary with the string as key and a list of its letters as the value
def genDict (in_value):
    out_dict = {}
    out_dict[in_value] = list(in_value)
    return(out_dict)

# Generate a list of all combinations of three alphabet letter strings
# this is not necesarily a best example for multithreading, but makes the point
# an io example would really accelerate under multithreading
alphabets = ['a', 'b', 'c', 'd', 'e']
listToProcess = [''.join(i) for i in product(alphabets, repeat = 4)]
print('Lenght of List to Process:', len(listToProcess))

# Send the list which is sent to the genDict function multithreaded
t1_start = process_time()
dictResult = {}
if "__main__" == __name__:
    with concurrent.futures.ProcessPoolExecutor(4) as executor:
        futures = [executor.submit(genDict, elem) for elem in listToProcess]
        for future in futures:
            dictResult.update(future.result())
t1_stop = process_time()
print('Multithreaded Completion time =', t1_stop-t1_start, 'sec.')

print('\nThis print statement is outside the loop and function but still gets wrapped in')
print('This is the size of the dictionary: ', len(dictResult))

And here is the output I am getting (note that the time calculation, as well as the print statement towards the end is "executed" multiple times). Output:

PS >> & C://multithread_test.py
Lenght of List to Process: 625
Lenght of List to Process: 625
Lenght of List to Process: 625
Multithreaded Completion time = 0.0 sec.
Multithreaded Completion time = 0.0 sec.

This print statement is outside the loop and function but still gets wrapped in
This print statement is outside the loop and function but still gets wrapped in

This is the size of the dictionary:  0
This is the size of the dictionary:  0
Lenght of List to Process: 625
Multithreaded Completion time = 0.0 sec.

This print statement is outside the loop and function but still gets wrapped in
This is the size of the dictionary:  0
Lenght of List to Process: 625
Multithreaded Completion time = 0.0 sec.

This print statement is outside the loop and function but still gets wrapped in
This is the size of the dictionary:  0
Multithreaded Completion time = 0.140625 sec.

This print statement is outside the loop and function but still gets wrapped in
This is the size of the dictionary:  625
PS >>

Solution

  • The ONLY things that should be outside of your if __name__ guard are the setting of global inputs, and the function to be executed. THAT'S IT. Remember that, with multiprocessing, each new thread starts a brand new interpreter, which re-runs your file, but with __name__ set to a different value. Anything outside the guard will be executed again in every process.

    Here is the way to organize this kind of code. This works.

    import concurrent.futures
    from itertools import product
    from time import process_time
    
    # This function generates a dictionary with the string as key and a list of its letters as the value
    def genDict (in_value):
        out_dict = {}
        out_dict[in_value] = list(in_value)
        return(out_dict)
    
    def main():
    # Generate a list of all combinations of three alphabet letter strings
    # this is not necesarily a best example for multithreading, but makes the point
    # an io example would really accelerate under multithreading
        alphabets = ['a', 'b', 'c', 'd', 'e']
        listToProcess = [''.join(i) for i in product(alphabets, repeat = 4)]
        print('Lenght of List to Process:', len(listToProcess))
    
    # Send the list which is sent to the genDict function multithreaded
        t1_start = process_time()
        dictResult = {}
        with concurrent.futures.ProcessPoolExecutor(4) as executor:
            futures = [executor.submit(genDict, elem) for elem in listToProcess]
            for future in futures:
                    dictResult.update(future.result())
        t1_stop = process_time()
        print('Multithreaded Completion time =', t1_stop-t1_start, 'sec.')
    
        print('\nThis print statement is outside the loop and function but still gets wrapped in')
        print('This is the size of the dictionary: ', len(dictResult))
    
    if "__main__" == __name__:
        main()