python concurrency parallel-processing multiprocessing scientific-computing

Harvesting the power of highly-parallel computers with python scientific code

I run into the following problem when writing scientific code with Python:

Usually you write the code iteratively, as a script which perform some computation.
Finally, it works; now you wish to run it with multiple inputs and parameters and find it takes too much time.
Recalling you work for a fine academic institute and have access to a ~100 CPUs machines, you are puzzled how to harvest this power. You start by preparing small shell scripts which run the original code with different inputs and run them manually.

Being an engineer, I know all about the right architecture for this (with work items queued, and worker threads or processes, and work results queued and written to persistent store); but I don't want to implement this myself. The most problematic issue is the need for reruns due to code changes or temporary system issues (e.g. out-of-memory).

I would like to find some framework to which I will provide the wanted inputs (e.g. with a file with one line per run) and then I will be able to just initiate multiple instances of some framework-provided agent which will run my code. If something went bad with the run (e.g. temporary system issue or thrown exception due to bug) I will be able to delete results and run some more agents. If I take too many resources, I will be able to kill some agents without a fear of data-inconsistency, and other agents will pick-up the work-items when they find the time.

Any existing solution? Anyone wishes to share his code which do just that? Thanks!

Solution

First of all, I would like to stress that the problem that Uri described in his question is indeed faced by many people doing scientific computing. It may be not easy to see if you work with a developed code base that has a well defined scope - things do not change as fast as in scientific computing or data analysis. This page has an excellent description why one would like to have a simple solution for parallelizing pieces of code.

So, this project is a very interesting attempt to solve the problem. I have not tried using it myself yet, but it looks very promising!