Search code examples
pythontensorboardslurm

Running tensorboard and python script at the same time


I want to submit an sbatch script. The main part is training a deep learning model but I also want to run tensorboard at the same time for logging.

Now I have my script.slurm

#!/bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks-per-node=6
#SBATCH --mem-per-cpu=5GB
#SBATCH --gres=gpu:1

tensorboard --logdir:runs
python3 trainloop.py

It launches tensorboard and runs the script only after I close tensorboard server. I changed it to

srun tensorboard --logdir:runs &
srun python3 trainloop.py

but now it loops for some reason trying to launch tensorboard multiple times and gives this error

E1114 21:45:51.826188 47451355829184 program.py:298] TensorBoard could not bind to port 8872, it was already in use

What is the best approach to have tensorboard server running alongside my script?


Solution

  • Adding the ampersand (&) is the right solution, but you should not be using srun as srun will start as many tasks (i.e. as many instances of tensorboard --logdir:runs as there are tasks requested with --ntasks-per-node=6, which will produce the "already in use" error. Same for the second srun, it will start 6 instances of python3 trainloop.py unless that script uses MPI behind the scenes.

    So this

    tensorboard --logdir:runs &
    python3 trainloop.py
    

    should do what you want.