Search code examples
pythonpytorchdistributedoptuna

Does optuna.integration.TorchDistributedTrial support multinode optimization?


Does integration.TorchDistributedTrial support multinode optimization?

I'm using Optuna on a SLURM cluster. Suppose I would like to do a distributed hyperparameter optimization using two nodes with two gpus each. Would submitting a script like pytorch_distributed_simple.py to multiple nodes yield expected results?

I assume every node would be responsible for executing their own trials (i.e. no nodes share trials) and every gpu on a node is responsible for its own portion of the data, determined by torch.utils.data.Dataloader's sampler. Is this assumption correct or are edits needed apart from TorchDistributedTrial's requirement to pass None to objective calls on ranks other than 0.

I already tried the above, but I'm not sure how to check every node is responsible for distinct trials.


GitHub issues crosspost


Solution

  • Apparently, Optuna does allow multiple Optuna processes to do distributed runs. Why wouldn't it :)

    Basically, run pytorch_distributed_simple.py on multiple nodes (I use SLURM for this). Every node is now responsible for its own trial. Trials can use DDP.

    My method differs from the provided code in that I use SLURM (different environment variables) and I use sqlite to store study information. Moreover, I use the NCCL backend to initialize process groups, and therefore need to pass a device to TorchDistributedTrial.


    Unrelated, but I also wanted to call MaxTrialsCallback() in every subprocess. To achieve this, I passed the callback to the rank 0 study.optimizer method and call it explicitly in local non-rank 0 processes after the objective call.