I have a single task to complete X number of times in Python and I will be using LSF to speed that up. Is it better to submit a job containing several Python scripts which can be run separately in parallel or one Python script that utilizes the multiprocessor module?
My issue is I don't trust LSF to know how to split up the Python code into several processes (I'm not sure how LSF does this). However, I also don't want several Python scripts floating around as that seems inefficient and disorganized.
The task at hand involves parsing six very large ASCII files and saving the output in a Python dict for later use. I want to parse the six files in parallel (they take about 3 minutes each). Does LSF allow Python to tell it something like "Hey, here's one script, but you're going to split it into these six processes"? Does LSF need Python to tell it that or does it already know how to do that?
Let me know if you need more info. I have trouble balancing between "just enough" and "too much" background.
One (very simplified) way to think of LSF is as a system that launches a process and lets the process know how many cores (potentially on different hosts) have been allocated to it. LSF can't prevent your program from doing something stupid (like for example, if multiple instances of it run at the same time, and one instance overwrites the other's output).
Some common ways of using LSF.
$LSB_MCPU_HOSTS
will tell you how many cores are assigned to the job, so you know how big to make the pool.I'm not sure which of these 3 ways best fits you needs. But I hope that the explanation helps you decide.