I have workers and tasks to do:
workers = ['peter', 'paul', 'mary']
tasks = range(13)
Now I want to split the tasks into chunks or batches of work, so that each worker can work on one batch and does about the same amount of work as everybody else. In my real life I want to schedule batch jobs to a compute farm. The batch jobs are supposed to run in parallel. The actual schedule&dispatch is done by a commercial grade tool such as lsf or grid.
Some examples of what I would expect:
>>> distribute_work(['peter', 'paul', 'mary'], range(3))
[('peter', [0]), ('paul', [1]), ('mary', [2])]
>>> distribute_work(['peter', 'paul', 'mary'], range(6))
[('peter', [0, 3]), ('paul', [1, 4]), ('mary', [2, 5])]
>>> distribute_work(['peter', 'paul', 'mary'], range(5))
[('peter', [0, 3]), ('paul', [1, 4]), ('mary', [2])]
This question is very similar to the questions here, here, and here
The difference is that I want these features, in the order or precedence:
len
, if possible no build-up of long data structures internallySome side notes on requirements:
I have tried to wrap my head around itertools
and this particular problem and came up with the following code to illustrate the question:
from itertools import *
def distribute_work(workers, tasks):
batches = range(len(workers))
return [ ( workers[k],
[t[1] for t in i]
) for (k,i) in groupby(sorted(zip(cycle(batches),
tasks),
key=lambda t: t[0]),
lambda t: t[0]) ]
This satisfies 4., but the sort very likely violates 1.. And 2./3. are not even thought about.
Probably there's some easy solution to this, combining some stdlib components in a way I haven't thought of. But maybe not. Any takers?
I think you want to use multiprocessing.Pool.imap
to handle your workers and allocating their jobs. I believe it does everything you want.
jobs = (some generator) # can consume jobs from a generator
pool = multiprocessing.Pool(3) # set number of workers here
results = pool.imap(process_job, jobs) # returns a generator
for r in results: # loop will block until results arrive
do_something(r)
If the order of the results doesn't matter for your application, you can also use imap_unordered
.