Search code examples
linuxpytorchparallel-processingnvidiahuggingface-transformers

How to kill training on specific GPUs?


I am training transformers model on differtnt GPUs(3 gpus out of 8) and want to kill training on spesfic gpus only (0,6,7) I trained top command I can see only PID enter image description here. But don't know which GPUs belong to PID THE kill -9 I do not want to use because don't know which GPU will stop as I want to stop (0,7,6) and keep the others running

I reproduce the problem with a small example :

from accelerate import Accelerator, notebook_launcher
from accelerate.utils import set_seed

def training_loop():
    set_seed(42)
    accelerator = Accelerator(mixed_precision="fp16")
    print("Hello There!")
    # main()   
notebook_launcher(training_loop(),  num_processes=2) #training_loop(),

lunching the script with termonal :

CUDA_VISIBLE_DEVICES=0,6,7

python3 AccelerateTrainer.py

I expect after running Nvidia-smi 0% for both 0,6, and 7 GPUs


Solution

  • I found this Linux command that can list all the past or current starting processes:

    ps -eo pid,lstart,cmd -u user_name | grep -i python3 And then kill specific GPUs by following the command after I know which script running on specific GPUs kill -9 <process_id>