python multiprocessing generator python-itertools pool

Asynchronously generating permutations using a pool of workers and a generator object python

I need to get some permutations run asynchronously in order to reduce the time that it takes to produce a file of all possible permutations within a list. I have had several attempts at multiprocessing this with no success.

Required Result:

A file containing a list of strings in the following format: PRE + JOINEDPERMUTATION

where PRE is from list 'prefix'

where JOINEDPERMUTATION is found from "".join(x)

where x is found from permutations(items, repetitions)

ITEMS is my list of values that I need to retrieve the permutations

REPETITIONS I wish to find every permutation of this list in range(8) repetitions

items=['a','b','c']
prefix=['one','two','three']
from itertools import permutations
from multiprocessing import Pool
pool=Pool(14)

def permutations(pre, repetitions, items):
    PERMS = [ pre + "".join(x) for x in permutations(items, repetitions) ]
    return PERMS

def result_collection(result):
    results.extend(result)
    return results

results=[]

args = ((pre, repetitions, items) for pre in prefix for repetitions in range(5))

for pre, repetitions, items in args:
    pool.apply_async(permutations, (pre, repetitions, items), callback=result_collection)
pool.close()
pool.join()

with open('file.txt','a',encoding='utf-8') as file:
    file.writelines(results)

I am not getting an error per-se but after running this program with a list where ITEMS had 50 elements and PREFIXES had 5; it wasn't finished after 8 hours and I have no real idea how to investigate further.

A quick aside query as well am I right in thinking that there is basically no use for 'pool.map' in multiprocessing module given that it will only ever take advantage of one worker? Why is this here?

Solution

Hard to believe that you are not getting an error per-se, this thing should raise RuntimeError like crazy.

Inside a newly spawned process, the module from which it is spawned is loaded, i.e. executed. That means your code tries to create 14 processes, that each try to create 14 processes that each try to create 14 ... you might see the pattern developing here :)

You'll have to put everything that may only be executed from the main process inside a __name__ == '__main__' block. This will prevent execution of these parts of the code in the workers because for them, __name__ is __mp_name__.

Doing that will fix the multiprocessing part, still there's another issue. You import permutations from itertools, then you create a function with the same name in your namespace, effectively overwriting the function from itertools. When your processes call your function permutations, the line PERMS = [ pre + "".join(x) for x in permutations(items, repetitions) ] will raise a TypeError because you call your permutations function there, but with two arguments, instead of the three that your function definition requires.

This should do what you want:

from itertools import permutations as it_perms
from multiprocessing import Pool
items=['a','b','c']
prefix=['one','two','three']


def permutations(pre, repetitions, items):
    PERMS = [ pre + "".join(x) for x in it_perms(items, repetitions) ]
    return PERMS

def result_collection(result):
    results.extend(result)
    return results


if __name__ == '__main__':
    pool = Pool(14)
    results = []

    args = ((pre, repetitions, items) for pre in prefix for repetitions in range(5))

    for pre, repetitions, items in args:
        pool.apply_async(permutations, (pre, repetitions, items), callback=result_collection)
    pool.close()
    pool.join()

    with open('file.txt','a',encoding='utf-8') as file:
        file.writelines(results)

As for your side query: Where did you get the idea that pool.map() will only ever take advantage of one worker? You may want to chek the answers on this question, especially this one