I was granted an access to some high-performance computing system to conduct some experiments with machine learning.
This system has an IBM's LSF 10.1 installed.
I was instructed to run bsub
command to submit a new ML task to a queue.
I use Python+Keras+Tensorflow for my tasks.
My typical workflow is following. I define NN architecture and training parameters in a python script, train.py
, commit it to git repo, then run it.
Then I make some changes in train.py
, commit it and run again.
I've developed following bsub
script
#!/bin/bash
#
#BSUB -P "project"
#BSUB -q queue
#BSUB -n 1
#BSUB -o %J.log
#BSUB -e %J.err
#BSUB -cwd "/home/user/my_project/nntrain"
module load cuda9.0 cudnn_v7 nccl_2.1.15
source /home/user/my_python/bin/activate
export PYTHONPATH=/home/user/my_project/lib
python train.py 2>&1 | tee ${LSB_JOBID}_out.log
Now the question.
I've defined a network, then have run bsub < batch_submit
.
The job is put in a queue and is assigned some identifier, say 12345678.
While it is not running, waiting for a next free node, I make some changes to train.py
to create a new variant and submit it again in a similar manner: bsub < batch_submit
Let the new job ID be 12345692. The job 12345678 is still waiting.
Now I've got two jobs, waiting for their nodes.
What about the script train.py?
Will it be the same for both of them?
Yes, it will. When you submit the job, bsub
will look only at the first few lines starting with #BSUB
in order to determine what resources are required by your job, and on which node(s) to run it best.
All the other parts of the script, which do not start with #BSUB
, are interpreted only when the script stops pending
and starts running
. In one particular line, bash
will encounter the command python train.py
, load the current version of train.py
, and execute it.
That is, bsub
does not "freeze" the environment in any way; when the job starts running, it will run the latest version of train.py
. If you submit two jobs that both refer to the same .py
-file, they both will run the same python script (the latest version).
In case you're wondering how to run thousand jobs with thousand different settings, here is what I usually do:
.py
script can either accept command line arguments with configuration parameters, or that it can get configuration from some file; Do not rely on manual modification of the script to change some settings.Create a bsub-template file that looks approximately like your bash script above, but leaves at least one meta-variable which can specify the parameters of the experiment. By "meta-variable" I mean a unique string that doesn't collide with anything else in your bash script, for example NAME_OF_THE_DATASET
:
#!/bin/bash
#
#BSUB -P "project"
#BSUB -q queue
#BSUB -n 1
#BSUB -o %J.log
#BSUB -e %J.err
#BSUB -cwd "/home/user/project/nntrain"
module load cuda9.0 cudnn_v7 nccl_2.1.15
source /home/user/my_python/bin/activate
export PYTHONPATH=/home/user/my_project/lib
python train.py NAME_OF_THE_DATASET 2>&1 | tee ${LSB_JOBID}_out.log
Create a separate bash-script with a loop that plugs in different values for the metavariable (e.g. by replacing NAME_OF_THE_DATASET
by myDataset1.csv
, ... , myDatasetN.csv
using sed
), and then submits the modified template by bsub
.
It might be not the simplest solution (one probably can get away with simpler numbering schemes using the facilities of bsub
itself), but I found it to be very flexible, because it works equally well with multiple meta-variables and all kind of flags and settings, and it also lets you insert different preprocessing scripts into the bsub
template.