Search code examples
pythonmultiprocessinglsf

LSF: Submit one Python script that uses multiprocessor module *or* submit several scripts at once that are "pre-split"?


I have a single task to complete X number of times in Python and I will be using LSF to speed that up. Is it better to submit a job containing several Python scripts which can be run separately in parallel or one Python script that utilizes the multiprocessor module?

My issue is I don't trust LSF to know how to split up the Python code into several processes (I'm not sure how LSF does this). However, I also don't want several Python scripts floating around as that seems inefficient and disorganized.

The task at hand involves parsing six very large ASCII files and saving the output in a Python dict for later use. I want to parse the six files in parallel (they take about 3 minutes each). Does LSF allow Python to tell it something like "Hey, here's one script, but you're going to split it into these six processes"? Does LSF need Python to tell it that or does it already know how to do that?

Let me know if you need more info. I have trouble balancing between "just enough" and "too much" background.


Solution

  • One (very simplified) way to think of LSF is as a system that launches a process and lets the process know how many cores (potentially on different hosts) have been allocated to it. LSF can't prevent your program from doing something stupid (like for example, if multiple instances of it run at the same time, and one instance overwrites the other's output).

    Some common ways of using LSF.

    • Run 6 sequential jobs that process one file each. These 6 can run in parallel. Have a dependant seventh job that runs after the previous 6 finish, which will combine the output of the previous 6 into a single output.
    • Run a parallel job that is assigned 6 cores on a single host. Seems that the python multiprocessing module would fit in well here. The env variable $LSB_MCPU_HOSTS will tell you how many cores are assigned to the job, so you know how big to make the pool.
    • Run a parallel jobs that is assigned 6 cores, and could run on multiple hosts. Again, your process must be able to start itself on these other hosts. (or use blaunch to help out)

    I'm not sure which of these 3 ways best fits you needs. But I hope that the explanation helps you decide.