The Problem
I need to create a list of combinations of items from a master list of length n
. With a small number of items in the master list, this can be done without parallelization and happen quickly. However, when I try to use the multiprocessing
library to parallelize the generation of the list, it seemingly takes longer. Here's an example:
Pretend I have a list of items (I'll use ice cream flavors) called item_list
item_list = ['chocolate', 'strawberry', 'vanilla', 'pineapple']
I want to generate all of the combinations of the flavors, with combinations of size n_items
, so I wrote a function to do this:
import itertools
def make_all_combinations_n_items(item_list, n_items):
out = []
for combo in itertools.combinations(item_list, n_items):
out.append(combo)
return out
And if I call it for size n_items = 2
, it produces the list of tuples quickly.
make_combinations_n_items(item_list, 2)
I know that when the number of items grows, the number of possible combinations grows as 2n, and I want to parallelize this process to generate the combinations faster (essentially have each core work on a different values of n_items
). I have 4 cores available.
To try to parallelize it, I used the multiprocessing
library, guided by this example, as follows:
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
new_combos = [pool.apply(
make_all_combinations_n_items,
args = (item_list, n_items))
for n_items in range(1, len(item_list) + 1)
]
pool.close()
The process doesn't happen nearly as quickly, and in fact, I can't tell if the process is working at all. When I copied the example code I was following and ran it, I had the same results.
Questions
I have two questions:
1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?
2) Is there a better/faster/more efficient way to produce all of these combinations?
I can provide more information as needed. Thanks in advance for the help.
Q : 1) Is this the proper way to parallelize this function? Or is there a better/more efficient/faster way?
Q : 2) Is there a better/faster/more efficient way to produce all of these combinations?
all might have already detected the actual complexity here grows approx. O( ~n!/k!/(n-k)! )
in [SPACE]
(RAM) and similarly approx. O( ~n! / k! / (n-k)! )
[TIME]
but, jumps into 100.000 x slower access times are near to come and will ... as the [SPACE]
will cease to provide RAM enough for a pure in-RAM-computing ( one can easily hang 64-bit O/S just by exhausting memory-management by the ill-designed producing and storing combinatorics data just on N-tuples from as few as 100-Candidates.
While we may, and will, unload as much as possible from standard python overheads, using hashed-tricks and smart-iterators with processing set()
-s instead of list
-manipulations ( list
is expensive here, as it has to maintain the order, which combinations do not ), still the RAM consumption will drive in both the [SPACE]
and [TIME]
dimensions the resulting costs.
Selecting pairs from just 10 candidates takes almost no RAM in [SPACE]
and costs ~ 30 [us]
in [TIME]
:
>>> from zmq import Stopwatch # Stopwatch methods {.start()|.stop()} used
>>> aClk = Stopwatch() # for benchmarking ~ [us]-resolution timer
>>> import itertools # for .combinations()
>>>
>>> aSet = set( range( int( 1E1 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
30
30
30
53
30
Selecting triples, again from 10 candidates takes almost the same:
>>> aSet = set( range( int( 1E1 ) ) )
>>> aClk.start(); _ = [ i for i in itertools.combinations( aSet, 3 ) ]; aClk.stop()
56
54
53
52
51
[SPACE]
start to matter, and a lot :Selecting pairs from about a hundred candidates is still rather cheap 1.3 ~ 1.6 [ms] = 1300 ~ 1600 [us]
, producing almost ~ 5000 legal pairs:
>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,2)];aClk.stop()
1963
1591
1322
1326
1547
1601
Selecting triples goes into the range of about ~ 33 ~ 35 [ms]
as it produces and stores about ~ 162.000
legal triples :
>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,3)];aClk.stop()
34741
33505
35580
35932
32989
33434
Selecting quartets will get us to ~ 600 ~ 730 [ms]
as it must store all ~ 3.921.225
legal quartets :
>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,4)];aClk.stop()
592453
723026
721529
729759
699055
704901
Selection of 5-tuples from just a hundred candidates will need 8 GB of RAM ( or devastating swaps will start to appear at an additional costs of about 100.000x slower processing, compared to an in-RAM case ), still bearing just some ~ 452.000.000 [us]
to generate and store ~ 75.287.520
legal 5-tuples of cheapest data-type of small int
-s in about 6.1 GB
:
>>> aSet = set( range( int( 1E2 ) ) )
>>> aClk.start();_=[i for i in itertools.combinations(aSet,5)];aClk.stop()
451758912
CPU-monitor confirms this, showing not more than about 30% - 40%
CPU-loads
the problem soon goes into a memory-[SPACE]
-bound problem, where [TIME]
costs are related more to the memory management and memory allocation and mem-I/O, than to the nature of calculations.
If in this case one would try to instantiate a-few-( the worse the many-)-[CONCURRENT]
processes to work ( the thread-based attempt will not help here, due to the python central GIL-locks, that is known to re-[SERIAL]
-ise all threaded-work, that actually prevents any form of concurrent-processing and just corner-cases, where latency-masking may help will generate any positive reward for spending efforts on going multi-threaded in standard python ( plus - beware the costs of all multiprocessing
-related costs of instantiations - right - again the [SPACE]
-related and [TIME]
-related costs matter times how many instances one tries to spawn ( and how ) )
If one would try to split the work ( which may seem attractive ) for the sake of reducing the problem of N-tuples into just a coordinated generation of (N-1)-tuples over 4-CPU-cores in a just-[CONCURRENT]
-pipeline fashion, the expected acceleration ( going (N-1)-tuples generation instead of the full-scale N-tuples ) will cost immense add-on costs on managing the re-assembly of partial results, coming at still remarkable [SPACE]
-size from the just-[CONCURRENT]
-pipeline and memory-management costs are expensive to pay, if just re-joining the partial results, during still many-to-run memory-bound processing-tasks are waiting to get processed through the artificially split-join 4-CPU-cores orchestrated work-flow.
A few hard facts, demonstrated above on N-tuples created from as few as 100-Candidates are a great opportunity to test any better approach, if there is any such, to beat the double-trouble in the [SPACE]
-and-[TIME]
-bound factorial-scaled school-book problem.
Best regards from me to your Computing Science professor to make you go into this corner of the Complexity-ZOO Hell ... and make your hands dirty.
While for small sizes ( tuples, triples ) both the [SPACE]
and [TIME]
costs, that have been shown for the pure-[SERIAL]
processing, are so small, if not negligible, it will never justify an attempt to spawn a new 4-[CONCURRENT]
-processes ( at remarkable add-on costs ) to harness all 4-CPU-cores, and next to accommodate additional expensive add-on costs for re-assembling results from "distributed"-processes back to the main python process.
This will never justify the lump sum of all the add-on costs.
Smart, vectorised code, like numpy
, can "cooperate" in non-colliding operations with multi-core productivity, yet having close-to-zero add-on costs, as it does not instantiate separate processes to transfer the jobs for "remote"-execution, but rather uses low-level language coded functions, that work on multi-core without ever transferring any data to remote-processes, but rather letting other CPU-cores smart-compute independently right on the original data multiple regions, using smart-aligned cache-related benefits from doing it inside the pre-fetched coherent memory blocks.
Having larger sizes, where computing might somewhat enjoy a split-work, the standard python code immense [SPACE]
-related costs will soon devastate any heavily-sub-linear speedup, due to the add-on costs, mostly from results re-assembly back to the main python process plus all the 4-[CONCURRENT]
-work-units will devastatingly compete for a shared-pool of resources ( The RAM ), during their split-computing-"ride", again at an immense adverse costs of memory-management ( not to mention swap-originated system suffocation, that may easily hangup all the O/S efforts to fair-trade resources among all the asking processes and obeying all the system-designed priority schedules )
For syntax, check this.
For ideas on performance tuning this.
For benchmarking the add-on costs for transferring data to/from spawned processes this ( benchmarking facts were surprisingly hated there, so do not get distracted by hysteria of minus-votes, the benchmarking code for cases B, C and D is the important value to re-use ).
For even more details about proper accounting of all add-on costs of instantiating more processes and about ( parameters there + results back )-communications costs and about impacts of atomicity-of-work on final, realistically achievable speedups, do not miss and feel free to read about contemporary criticism of overhead naive Amdahl's law