Search code examples
pythonpytorchmultiprocessingkagglemulti-gpu

Weird PyTorch Multiprocessing Error Where Main Loop Is Not Defined In __main__ | Kaggle


The following PyTorch code for single-node multi-GPU training with DDP seen here:

https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py

Gives the following error when running in a Kaggle environment with two GPU T4 accelerators:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/opt/conda/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'main' on <module '__main__' (built-in)>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/opt/conda/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'main' on <module '__main__' (built-in)>
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
Cell In[11], line 104
     95 if __name__ == "__main__":
     96 #     import argparse
     97 #     parser = argparse.ArgumentParser(description='simple distributed training job')
   (...)
    100 #     parser.add_argument('--batch_size', default=32, type=int, help='Input batch size on each device (default: 32)')
    101 #     args = parser.parse_args()
    103     world_size = torch.cuda.device_count()
--> 104     mp.spawn(main, args=(world_size, 5, 10, 32), nprocs=world_size)

Any Information is appreciated.


Solution

  • To make the DDP code work when running in a notebook, you must include:

    %%writefile ddp.py at the top of the DDP code. To run the code, and train the model, in another cell call: !python -W ignore ddp.py