How to set up RayTune for distributed training using Pytorch Lightning

I have 16 CPUs and 1 GPU, and I want to split 2 concurrent trials across all my available resources.

Attempt number one:

# define param_space and run_config
...
...

CONCURRENT_TRIALS = True
scaling_config = ScalingConfig(
    num_workers=1 + CONCURRENT_TRIALS,
    use_gpu=True,
    resources_per_worker={"GPU": 0.5, "CPU": 8} if CONCURRENT_TRIALS else {"GPU": 1, "CPU": 15},
)

# Define a TorchTrainer without hyper-parameters for Tuner
trainer_args = dict(
    train_loop_per_worker=train_func,
    scaling_config=scaling_config,
    run_config=run_config,
    )
if CONCURRENT_TRIALS:
    trainer_args["torch_config"] = ray_torch.TorchConfig(backend="gloo")

ray_trainer = TorchTrainer(
    **trainer_args,
)

tuner = tune.Tuner(
    ray_trainer,
    param_space={"train_loop_config": search_space},
    tune_config=tune.TuneConfig(
        metric="val_loss",
        mode="min",
        num_samples=n_samples,
        scheduler=scheduler,
        max_concurrent_trials=1 + CONCURRENT_TRIALS,
    ),
)

tuner.fit()

This results in the following:

WARNING insufficient_resources_manager.py:163 -- Ignore this message if the cluster is autoscaling. No trial is running and no new trial has been started within the last 60 seconds. This could be due to the cluster not having enough resources available. You asked for 17.0 CPUs and 1.0 GPUs per trial, but the cluster only has 16.0 CPUs and 1.0 GPUs available. Stop the tuning and adjust the required resources (e.g. via the `ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), or add more resources to your cluster.
Trial status: 2 PENDING
Logical resource usage: 0/16 CPUs, 0/1 GPUs

I assume this means 1 CPU is needed for handling Ray checkpointing, loading, etc.

So I change to 7 CPUs per worker:

scaling_config = ScalingConfig(
    num_workers=1 + CONCURRENT_TRIALS,
    use_gpu=True,
    resources_per_worker={"GPU": 0.5, "CPU": 7} if CONCURRENT_TRIALS else {"GPU": 1, "CPU": 15},
)

Then the second trial is perpetually stuck on PENDING, with the following logical resource usage:

Logical resource usage: 15.0/16 CPUs, 1.0/1 GPUs

I've also tried doing tune.with_resources for train_func, as well both with and without the scaling_config changes, and can't get it to just split evenly between two trials at once. Also, scaling_config = ScalingConfig(..., placement_strategy="SPREAD") doesn't work either. I have a ton of RAM and GPU memory left over while training one process, so I want to take advantage of Ray's parallel scheduling, but I can't figure this out.

Solution

The problem is ScalingConfig(num_workers=1 + CONCURRENT_TRIALS, ...). My assumption was that I force Ray to use 2 workers total, i.e., 2 trials concurrently. What this parameter really means is number of workers per trial. So num_workers=2 corresponds with 2 nodes per trial, resulting in one trial being distributed across the 2 nodes, and other trials stuck at pending.

Set this to 1 and keep max_concurrent_trials=2, and this fixes the problem.