Search code examples
slurm

How do I get live usage statistics while running a slurm job


I am a SLURM newbie. I usually like to run jobs interactively, rather than using SBATCH. This is how I request resources -

srun --time=10:00:00 --nodes=1 --cpus-per-task=16 --mem=64G  --partition=gpu --gres=gpu:2 --pty /usr/bin/bash

In addition, I can also find out my job id for the allocated resources by doing -

squeue -u <my_username>

I'd like to obtain live statistics of the GPU memory being consumed, number of cpus active, etc. Is there any way to do that?

I've already checked questions like this on SO. However, they don't have the answer to my question.

Please let me know if my question requires any further clarification.


Solution

  • You can use WandB, a tool primarily used for tracking machine learning training. It is only applicable to Python but there are C++ ports too.

    By default you get 23 system metrics, including information about GPU, CPU, time taken, disk usage, RAM usage etc. And values are updated every few seconds. Alongside you can track the value of any variable.

    Here is one example. I ran 360 experiments on 20+ GPU. And you can click on any experiment to see values of variables as well as the system usage, which was the original question.