Search code examples
pythonpytorchconcurrencygpupytorch-lightning

Concurrently test several Pytorch models on a single GPU slower than iterative approach


I want to test several models on different datasets. I want them to run concurrently on a single gpu.
The general puesdo code is below. You would be able to run this with any dummy model or dataset. However, I notice this is often 20% slower than just using an iterative approach. What is wrong with my current solution?

def test_model(paths):
   model = Model(path to config)
   data = DataLoader(*Dataset(path to data)
   trainer = PyTorchLightning Trainer(config)


def main():
    with ProcessPoolExecutor(max_workers=args.max_processes) as executor:
        futures = {executor.submit(test_model, paths): (paths) for path in paths_to models_to_test}
        for future in as_completed(futures):
            try:
                results_from_run = future.result()
                full_results_df = pd.concat([full_results_df, results_from_run], ignore_index=True)
            except Exception as e:
                print(f"An error occurred while processing {futures[future]}: {e}")

Solution

  • GPUs by default run one kernel at a time with maximum parallelization. This means your multiple model instances are blocking each other, resulting in the slower execution.

    You can try schedule processes in a delayed fashion using joblib (see torchensemble as an example), or schedule models using a pytorch cuda stream.

    That said, neither approach will give you perfect parallelization.

    GPU kernels will still be blocking if the workload is enough to saturate GPU occupancy.

    There's also significant overhead on the CPU (scheduling CUDA kernels, data processing, moving data to the CPU) and i/o (reading multiple datasets from disk concurrently) that could prove rate limiting regardless of GPU parallelization.

    Before trying the parallel approaches, you should try optimize the iterative approach by increasing batch size and/or pre-loading data into memory. It may end up being the fastest and easiest approach.