Training neural network in IBM Load Sharing Facility (LSF)

I was granted an access to some high-performance computing system to conduct some experiments with machine learning.

This system has an IBM's LSF 10.1 installed. I was instructed to run bsub command to submit a new ML task to a queue.

I use Python+Keras+Tensorflow for my tasks.

My typical workflow is following. I define NN architecture and training parameters in a python script, train.py, commit it to git repo, then run it. Then I make some changes in train.py, commit it and run again.

I've developed following bsub script

#!/bin/bash
# 
#BSUB -P "project"
#BSUB -q queue
#BSUB -n 1
#BSUB -o %J.log
#BSUB -e %J.err
#BSUB -cwd "/home/user/my_project/nntrain"

module load cuda9.0 cudnn_v7 nccl_2.1.15
source /home/user/my_python/bin/activate
export PYTHONPATH=/home/user/my_project/lib

python train.py 2>&1 | tee ${LSB_JOBID}_out.log

Now the question.

I've defined a network, then have run bsub < batch_submit. The job is put in a queue and is assigned some identifier, say 12345678.

While it is not running, waiting for a next free node, I make some changes to train.py to create a new variant and submit it again in a similar manner: bsub < batch_submit

Let the new job ID be 12345692. The job 12345678 is still waiting.

Now I've got two jobs, waiting for their nodes.

What about the script train.py?

Will it be the same for both of them?

Solution

Yes, it will. When you submit the job, bsub will look only at the first few lines starting with #BSUB in order to determine what resources are required by your job, and on which node(s) to run it best.

All the other parts of the script, which do not start with #BSUB, are interpreted only when the script stops pending and starts running. In one particular line, bash will encounter the command python train.py, load the current version of train.py, and execute it.

That is, bsub does not "freeze" the environment in any way; when the job starts running, it will run the latest version of train.py. If you submit two jobs that both refer to the same .py-file, they both will run the same python script (the latest version).

In case you're wondering how to run thousand jobs with thousand different settings, here is what I usually do:

Make sure that your .py script can either accept command line arguments with configuration parameters, or that it can get configuration from some file; Do not rely on manual modification of the script to change some settings.

Create a bsub-template file that looks approximately like your bash script above, but leaves at least one meta-variable which can specify the parameters of the experiment. By "meta-variable" I mean a unique string that doesn't collide with anything else in your bash script, for example NAME_OF_THE_DATASET:

#!/bin/bash
# 
#BSUB -P "project"
#BSUB -q queue
#BSUB -n 1
#BSUB -o %J.log
#BSUB -e %J.err
#BSUB -cwd "/home/user/project/nntrain"

module load cuda9.0 cudnn_v7 nccl_2.1.15
source /home/user/my_python/bin/activate
export PYTHONPATH=/home/user/my_project/lib

python train.py NAME_OF_THE_DATASET 2>&1 | tee ${LSB_JOBID}_out.log

Create a separate bash-script with a loop that plugs in different values for the metavariable (e.g. by replacing NAME_OF_THE_DATASET by myDataset1.csv, ... , myDatasetN.csv using sed), and then submits the modified template by bsub.

It might be not the simplest solution (one probably can get away with simpler numbering schemes using the facilities of bsub itself), but I found it to be very flexible, because it works equally well with multiple meta-variables and all kind of flags and settings, and it also lets you insert different preprocessing scripts into the bsub template.