Search code examples
pythonparallel-processingpython-itertoolsparallelism-amdahl

Parallelize process of generating combinations


The Problem

I need to create a list of combinations of items from a master list of length n. With a small number of items in the master list, this can be done without parallelization and happen quickly. However, when I try to use the multiprocessing library to parallelize the generation of the list, it seemingly takes longer. Here's an example:

Pretend I have a list of items (I'll use ice cream flavors) called item_list

item_list = ['chocolate', 'strawberry', 'vanilla', 'pineapple']

I want to generate all of the combinations of the flavors, with combinations of size n_items, so I wrote a function to do this:

import itertools
def make_all_combinations_n_items(item_list, n_items):
    out = []
    for combo in itertools.combinations(item_list, n_items):
        out.append(combo)

    return out

And if I call it for size n_items = 2, it produces the list of tuples quickly.

make_combinations_n_items(item_list, 2)

I know that when the number of items grows, the number of possible combinations grows as 2n, and I want to parallelize this process to generate the combinations faster (essentially have each core work on a different values of n_items). I have 4 cores available.

To try to parallelize it, I used the multiprocessing library, guided by this example, as follows:

import multiprocessing as mp

pool = mp.Pool(mp.cpu_count())

new_combos = [pool.apply(
    make_all_combinations_n_items,
    args = (item_list, n_items))
    for n_items in range(1, len(item_list) + 1)
]

pool.close()

The process doesn't happen nearly as quickly, and in fact, I can't tell if the process is working at all. When I copied the example code I was following and ran it, I had the same results.

Questions

I have two questions:

1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?

2) Is there a better/faster/more efficient way to produce all of these combinations?

I can provide more information as needed. Thanks in advance for the help.


Solution

  • Q : 1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?
    Q : 2) Is there a better/faster/more efficient way to produce all of these combinations?

    Great case for demonstrating a true double-trouble-Hell inside the Complexity-ZOO :

    all might have already detected the actual complexity here grows approx. O( ~n!/k!/(n-k)! ) in [SPACE] (RAM) and similarly approx. O( ~n! / k! / (n-k)! ) [TIME] but, jumps into 100.000 x slower access times are near to come and will ... as the [SPACE] will cease to provide RAM enough for a pure in-RAM-computing ( one can easily hang 64-bit O/S just by exhausting memory-management by the ill-designed producing and storing combinatorics data just on N-tuples from as few as 100-Candidates.


    While we may, and will, unload as much as possible from standard python overheads, using hashed-tricks and smart-iterators with processing set()-s instead of list-manipulations ( list is expensive here, as it has to maintain the order, which combinations do not ), still the RAM consumption will drive in both the [SPACE] and [TIME] dimensions the resulting costs.

    Selecting pairs from just 10 candidates takes almost no RAM in [SPACE] and costs ~ 30 [us] in [TIME]:

    >>> from zmq import Stopwatch     # Stopwatch methods {.start()|.stop()} used
    >>> aClk = Stopwatch()            # for benchmarking ~ [us]-resolution timer
    >>> import itertools              # for .combinations()
    >>>
    >>> aSet = set( range( int( 1E1 ) ) )
    >>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
    30
    30
    30
    53
    30
    

    Selecting triples, again from 10 candidates takes almost the same:

    >>> aSet = set( range( int( 1E1 ) ) )
    >>> aClk.start(); _ = [ i for i in itertools.combinations( aSet, 3 ) ]; aClk.stop()
    56
    54
    53
    52
    51
    

    Now, the costs of moving data into [SPACE] start to matter, and a lot :

    Selecting pairs from about a hundred candidates is still rather cheap 1.3 ~ 1.6 [ms] = 1300 ~ 1600 [us], producing almost ~ 5000 legal pairs:

    >>> aSet = set( range( int( 1E2 ) ) )
    >>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
    1963
    1591
    1322
    1326
    1547
    1601
    

    Selecting triples goes into the range of about ~ 33 ~ 35 [ms] as it produces and stores about ~ 162.000 legal triples :

    >>> aSet = set( range( int( 1E2 ) ) )
    >>> aClk.start();_=[i for i in itertools.combinations(aSet,3)];aClk.stop()
    34741
    33505
    35580
    35932
    32989
    33434
    

    Selecting quartets will get us to ~ 600 ~ 730 [ms] as it must store all ~ 3.921.225 legal quartets :

    >>> aSet = set( range( int( 1E2 ) ) )
    >>> aClk.start();_=[i for i in itertools.combinations(aSet,4)];aClk.stop()
    592453
    723026
    721529
    729759
    699055
    704901
    

    Selection of 5-tuples from just a hundred candidates will need 8 GB of RAM ( or devastating swaps will start to appear at an additional costs of about 100.000x slower processing, compared to an in-RAM case ), still bearing just some ~ 452.000.000 [us] to generate and store ~ 75.287.520 legal 5-tuples of cheapest data-type of small int-s in about 6.1 GB:

    >>> aSet = set( range( int( 1E2 ) ) )
    >>> aClk.start();_=[i for i in itertools.combinations(aSet,5)];aClk.stop()
    451758912
    

    CPU-monitor confirms this, showing not more than about 30% - 40% CPU-loads the problem soon goes into a memory-[SPACE]-bound problem, where [TIME] costs are related more to the memory management and memory allocation and mem-I/O, than to the nature of calculations.

    If in this case one would try to instantiate a-few-( the worse the many-)-[CONCURRENT] processes to work ( the thread-based attempt will not help here, due to the python central GIL-locks, that is known to re-[SERIAL]-ise all threaded-work, that actually prevents any form of concurrent-processing and just corner-cases, where latency-masking may help will generate any positive reward for spending efforts on going multi-threaded in standard python ( plus - beware the costs of all multiprocessing-related costs of instantiations - right - again the [SPACE]-related and [TIME]-related costs matter times how many instances one tries to spawn ( and how ) )


    If one would try to split the work ( which may seem attractive ) for the sake of reducing the problem of N-tuples into just a coordinated generation of (N-1)-tuples over 4-CPU-cores in a just-[CONCURRENT]-pipeline fashion, the expected acceleration ( going (N-1)-tuples generation instead of the full-scale N-tuples ) will cost immense add-on costs on managing the re-assembly of partial results, coming at still remarkable [SPACE]-size from the just-[CONCURRENT]-pipeline and memory-management costs are expensive to pay, if just re-joining the partial results, during still many-to-run memory-bound processing-tasks are waiting to get processed through the artificially split-join 4-CPU-cores orchestrated work-flow.

    A few hard facts, demonstrated above on N-tuples created from as few as 100-Candidates are a great opportunity to test any better approach, if there is any such, to beat the double-trouble in the [SPACE]-and-[TIME]-bound factorial-scaled school-book problem.

    Best regards from me to your Computing Science professor to make you go into this corner of the Complexity-ZOO Hell ... and make your hands dirty.


    RESUME :

    While for small sizes ( tuples, triples ) both the [SPACE] and [TIME] costs, that have been shown for the pure-[SERIAL] processing, are so small, if not negligible, it will never justify an attempt to spawn a new 4-[CONCURRENT]-processes ( at remarkable add-on costs ) to harness all 4-CPU-cores, and next to accommodate additional expensive add-on costs for re-assembling results from "distributed"-processes back to the main python process.

    This will never justify the lump sum of all the add-on costs.

    Smart, vectorised code, like numpy, can "cooperate" in non-colliding operations with multi-core productivity, yet having close-to-zero add-on costs, as it does not instantiate separate processes to transfer the jobs for "remote"-execution, but rather uses low-level language coded functions, that work on multi-core without ever transferring any data to remote-processes, but rather letting other CPU-cores smart-compute independently right on the original data multiple regions, using smart-aligned cache-related benefits from doing it inside the pre-fetched coherent memory blocks.

    Having larger sizes, where computing might somewhat enjoy a split-work, the standard python code immense [SPACE]-related costs will soon devastate any heavily-sub-linear speedup, due to the add-on costs, mostly from results re-assembly back to the main python process plus all the 4-[CONCURRENT]-work-units will devastatingly compete for a shared-pool of resources ( The RAM ), during their split-computing-"ride", again at an immense adverse costs of memory-management ( not to mention swap-originated system suffocation, that may easily hangup all the O/S efforts to fair-trade resources among all the asking processes and obeying all the system-designed priority schedules )


    Last but not least - a BONUS part :

    For syntax, check this.

    For ideas on performance tuning this.

    For benchmarking the add-on costs for transferring data to/from spawned processes this ( benchmarking facts were surprisingly hated there, so do not get distracted by hysteria of minus-votes, the benchmarking code for cases B, C and D is the important value to re-use ).

    For even more details about proper accounting of all add-on costs of instantiating more processes and about ( parameters there + results back )-communications costs and about impacts of atomicity-of-work on final, realistically achievable speedups, do not miss and feel free to read about contemporary criticism of overhead naive Amdahl's law