python parallel-processing python-itertools parallelism-amdahl

Parallelize process of generating combinations

The Problem

I need to create a list of combinations of items from a master list of length n. With a small number of items in the master list, this can be done without parallelization and happen quickly. However, when I try to use the multiprocessing library to parallelize the generation of the list, it seemingly takes longer. Here's an example:

Pretend I have a list of items (I'll use ice cream flavors) called item_list

item_list = ['chocolate', 'strawberry', 'vanilla', 'pineapple']

I want to generate all of the combinations of the flavors, with combinations of size n_items, so I wrote a function to do this:

import itertools
def make_all_combinations_n_items(item_list, n_items):
    out = []
    for combo in itertools.combinations(item_list, n_items):
        out.append(combo)

    return out

And if I call it for size n_items = 2, it produces the list of tuples quickly.

make_combinations_n_items(item_list, 2)

I know that when the number of items grows, the number of possible combinations grows as 2ⁿ, and I want to parallelize this process to generate the combinations faster (essentially have each core work on a different values of n_items). I have 4 cores available.

To try to parallelize it, I used the multiprocessing library, guided by this example, as follows:

import multiprocessing as mp

pool = mp.Pool(mp.cpu_count())

new_combos = [pool.apply(
    make_all_combinations_n_items,
    args = (item_list, n_items))
    for n_items in range(1, len(item_list) + 1)
]

pool.close()

The process doesn't happen nearly as quickly, and in fact, I can't tell if the process is working at all. When I copied the example code I was following and ran it, I had the same results.

Questions

I have two questions:

1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?

2) Is there a better/faster/more efficient way to produce all of these combinations?

I can provide more information as needed. Thanks in advance for the help.

Solution

Q : 1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?
Q : 2) Is there a better/faster/more efficient way to produce all of these combinations?

Great case for demonstrating a true double-trouble-Hell inside the Complexity-ZOO :

all might have already detected the actual complexity here grows approx. O( ~n!/k!/(n-k)! ) in [SPACE] (RAM) and similarly approx. O( ~n! / k! / (n-k)! ) [TIME] but, jumps into 100.000 x slower access times are near to come and will ... as the [SPACE] will cease to provide RAM enough for a pure in-RAM-computing ( one can easily hang 64-bit O/S just by exhausting memory-management by the ill-designed producing and storing combinatorics data just on N-tuples from as few as 100-Candidates.

While we may, and will, unload as much as possible from standard python overheads, using hashed-tricks and smart-iterators with processing set()-s instead of list-manipulations ( list is expensive here, as it has to maintain the order, which combinations do not ), still the RAM consumption will drive in both the [SPACE] and [TIME] dimensions the resulting costs.

Selecting pairs from just 10 candidates takes almost no RAM in [SPACE] and costs ~ 30 [us] in [TIME]:

>>> from zmq import Stopwatch     # Stopwatch methods {.start()|.stop()} used
>>> aClk = Stopwatch()            # for benchmarking ~ [us]-resolution timer
>>> import itertools              # for .combinations()
>>>
>>> aSet = set( range( int( 1E1 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
30
30
30
53
30

Selecting triples, again from 10 candidates takes almost the same:

>>> aSet = set( range( int( 1E1 ) ) )
>>> aClk.start(); _ = [ i for i in itertools.combinations( aSet, 3 ) ]; aClk.stop()
56
54
53
52
51

Now, the costs of moving data into `[SPACE]` start to matter, and a lot :

Selecting pairs from about a hundred candidates is still rather cheap 1.3 ~ 1.6 [ms] = 1300 ~ 1600 [us], producing almost ~ 5000 legal pairs:

>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
1963
1591
1322
1326
1547
1601

Selecting triples goes into the range of about ~ 33 ~ 35 [ms] as it produces and stores about ~ 162.000 legal triples :

>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,3)];aClk.stop()
34741
33505
35580
35932
32989
33434

Selecting quartets will get us to ~ 600 ~ 730 [ms] as it must store all ~ 3.921.225 legal quartets :

>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,4)];aClk.stop()
592453
723026
721529
729759
699055
704901

Selection of 5-tuples from just a hundred candidates will need 8 GB of RAM ( or devastating swaps will start to appear at an additional costs of about 100.000x slower processing, compared to an in-RAM case ), still bearing just some ~ 452.000.000 [us] to generate and store ~ 75.287.520 legal 5-tuples of cheapest data-type of small int-s in about 6.1 GB:

>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,5)];aClk.stop()
451758912

CPU-monitor confirms this, showing not more than about 30% - 40% CPU-loads the problem soon goes into a memory-[SPACE]-bound problem, where [TIME] costs are related more to the memory management and memory allocation and mem-I/O, than to the nature of calculations.

If in this case one would try to instantiate a-few-( the worse the many-)-[CONCURRENT] processes to work ( the thread-based attempt will not help here, due to the python central GIL-locks, that is known to re-[SERIAL]-ise all threaded-work, that actually prevents any form of concurrent-processing and just corner-cases, where latency-masking may help will generate any positive reward for spending efforts on going multi-threaded in standard python ( plus - beware the costs of all multiprocessing-related costs of instantiations - right - again the [SPACE]-related and [TIME]-related costs matter times how many instances one tries to spawn ( and how ) )

If one would try to split the work ( which may seem attractive ) for the sake of reducing the problem of N-tuples into just a coordinated generation of (N-1)-tuples over 4-CPU-cores in a just-[CONCURRENT]-pipeline fashion, the expected acceleration ( going (N-1)-tuples generation instead of the full-scale N-tuples ) will cost immense add-on costs on managing the re-assembly of partial results, coming at still remarkable [SPACE]-size from the just-[CONCURRENT]-pipeline and memory-management costs are expensive to pay, if just re-joining the partial results, during still many-to-run memory-bound processing-tasks are waiting to get processed through the artificially split-join 4-CPU-cores orchestrated work-flow.

A few hard facts, demonstrated above on N-tuples created from as few as 100-Candidates are a great opportunity to test any better approach, if there is any such, to beat the double-trouble in the [SPACE]-and-[TIME]-bound factorial-scaled school-book problem.

Best regards from me to your Computing Science professor to make you go into this corner of the Complexity-ZOO Hell ... and make your hands dirty.

RESUME :

While for small sizes ( tuples, triples ) both the [SPACE] and [TIME] costs, that have been shown for the pure-[SERIAL] processing, are so small, if not negligible, it will never justify an attempt to spawn a new 4-[CONCURRENT]-processes ( at remarkable add-on costs ) to harness all 4-CPU-cores, and next to accommodate additional expensive add-on costs for re-assembling results from "distributed"-processes back to the main python process.

This will never justify the lump sum of all the add-on costs.

Smart, vectorised code, like numpy, can "cooperate" in non-colliding operations with multi-core productivity, yet having close-to-zero add-on costs, as it does not instantiate separate processes to transfer the jobs for "remote"-execution, but rather uses low-level language coded functions, that work on multi-core without ever transferring any data to remote-processes, but rather letting other CPU-cores smart-compute independently right on the original data multiple regions, using smart-aligned cache-related benefits from doing it inside the pre-fetched coherent memory blocks.

Having larger sizes, where computing might somewhat enjoy a split-work, the standard python code immense [SPACE]-related costs will soon devastate any heavily-sub-linear speedup, due to the add-on costs, mostly from results re-assembly back to the main python process plus all the 4-[CONCURRENT]-work-units will devastatingly compete for a shared-pool of resources ( The RAM ), during their split-computing-"ride", again at an immense adverse costs of memory-management ( not to mention swap-originated system suffocation, that may easily hangup all the O/S efforts to fair-trade resources among all the asking processes and obeying all the system-designed priority schedules )

Last but not least - a BONUS part :

For syntax, check this.

For ideas on performance tuning this.

For benchmarking the add-on costs for transferring data to/from spawned processes this ( benchmarking facts were surprisingly hated there, so do not get distracted by hysteria of minus-votes, the benchmarking code for cases B, C and D is the important value to re-use ).

For even more details about proper accounting of all add-on costs of instantiating more processes and about ( parameters there + results back )-communications costs and about impacts of atomicity-of-work on final, realistically achievable speedups, do not miss and feel free to read about contemporary criticism of overhead naive Amdahl's law

Parallelize process of generating combinations

Great case for demonstrating a true double-trouble-Hell inside the Complexity-ZOO :

Now, the costs of moving data into [SPACE] start to matter, and a lot :

RESUME :

Last but not least - a BONUS part :

Now, the costs of moving data into `[SPACE]` start to matter, and a lot :