I want to submit an sbatch script. The main part is training a deep learning model but I also want to run tensorboard at the same time for logging.
Now I have my script.slurm
#!/bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks-per-node=6
#SBATCH --mem-per-cpu=5GB
#SBATCH --gres=gpu:1
tensorboard --logdir:runs
python3 trainloop.py
It launches tensorboard and runs the script only after I close tensorboard server. I changed it to
srun tensorboard --logdir:runs &
srun python3 trainloop.py
but now it loops for some reason trying to launch tensorboard multiple times and gives this error
E1114 21:45:51.826188 47451355829184 program.py:298] TensorBoard could not bind to port 8872, it was already in use
What is the best approach to have tensorboard server running alongside my script?
Adding the ampersand (&
) is the right solution, but you should not be using srun
as srun will start as many tasks (i.e. as many instances of tensorboard --logdir:runs
as there are tasks requested with --ntasks-per-node=6
, which will produce the "already in use" error. Same for the second srun
, it will start 6 instances of python3 trainloop.py
unless that script uses MPI behind the scenes.
So this
tensorboard --logdir:runs &
python3 trainloop.py
should do what you want.