Search code examples
pytorchsemantic-segmentationdetectron

Using detectron2 for training, pytorch's runtime error 'default process group has not been initialized'


Instructions To Reproduce the Issue:

  1. Full runnable code or full changes you made:

I tried to train DeepLabV3+ architecture with a customized config having ResNet18 (converted to .pkl from https://download.pytorch.org/models/resnet18-f37072fd.pth) as the backbone:

_BASE_: "detectron2/projects/DeepLab/configs/Cityscapes-SemanticSegmentation/deeplab_v3_plus_R_103_os16_mg124_poly_90k_bs16.yaml"
MODEL:
  WEIGHTS: "r18.pkl"
  BACKBONE:
    NAME: "build_resnet_backbone"
  RESNETS:
    DEPTH: 18
    RES2_OUT_CHANNELS: 64
    STEM_OUT_CHANNELS: 64
    RES5_DILATION: 1
    NUM_GROUPS: 1
  ROI_HEADS:
    NUM_CLASSES: 1

on MoNuSeg 2020 dataset.

  1. What exact command you run:
cfg = get_cfg()
add_deeplab_config(cfg)
cfg.merge_from_file('/kaggle/input/deeplab-v3-plus-models/deeplab_v3_plus_R_18_os16_mg124_poly_90k_bs16.yaml')
cfg.DATASETS.TRAIN = ('monuseg_train',)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.SOLVER.IMS_PER_BATCH = 16
cfg.SOLVER.BASE_LR = 0.01
cfg.SOLVER.MAX_ITER = 300
cfg.SOLVER.LR_SCHEDULER_NAME = 'WarmupMultiStepLR'
cfg.SOLVER.STEPS = []
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()

I also used the following lines to also try Trainer code (from DeepLab project train_net.py):

cfg.SOLVER.LR_SCHEDULER_NAME = 'WarmupPolyLR'
trainer = Trainer(cfg)
  1. Full logs or other relevant observations:
[11/20 16:16:06 d2.engine.defaults]: Model:
SemanticSegmentor(
  (backbone): ResNet(
    (stem): BasicStem(
      (conv1): Conv2d(
        3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
        (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (res2): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (res3): Sequential(
      (0): BasicBlock(
        (shortcut): Conv2d(
          64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv1): Conv2d(
          64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (res4): Sequential(
      (0): BasicBlock(
        (shortcut): Conv2d(
          128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv1): Conv2d(
          128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
    (res5): Sequential(
      (0): BasicBlock(
        (shortcut): Conv2d(
          256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv1): Conv2d(
          256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): BasicBlock(
        (conv1): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (conv2): Conv2d(
          512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
          (norm): SyncBatchNorm(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
  )
  (sem_seg_head): DeepLabV3PlusHead(
    (decoder): ModuleDict(
      (res2): ModuleDict(
        (project_conv): Conv2d(
          64, 48, kernel_size=(1, 1), stride=(1, 1), bias=False
          (norm): SyncBatchNorm(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
        (fuse_conv): Sequential(
          (0): Conv2d(
            304, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
          (1): Conv2d(
            256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
            (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
      )
      (res5): ModuleDict(
        (project_conv): ASPP(
          (convs): ModuleList(
            (0): Conv2d(
              512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
              (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (1): Conv2d(
              512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6), bias=False
              (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (2): Conv2d(
              512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(12, 12), dilation=(12, 12), bias=False
              (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (3): Conv2d(
              512, 256, kernel_size=(3, 3), stride=(1, 1), padding=(18, 18), dilation=(18, 18), bias=False
              (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            (4): Sequential(
              (0): AvgPool2d(kernel_size=(16, 32), stride=1, padding=0)
              (1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
            )
          )
          (project): Conv2d(
            1280, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
            (norm): SyncBatchNorm(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          )
        )
        (fuse_conv): None
      )
    )
    (predictor): Conv2d(256, 19, kernel_size=(1, 1), stride=(1, 1))
    (loss): DeepLabCE(
      (criterion): CrossEntropyLoss()
    )
  )
)
[11/20 16:16:07 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [RandomCrop(crop_type='absolute', crop_size=[512, 1024]), ResizeShortestEdge(short_edge_length=(512, 768, 1024, 1280, 1536, 1792, 2048), max_size=4096, sample_style='choice'), RandomFlip()]
[11/20 16:16:07 d2.data.build]: Using training sampler TrainingSampler
[11/20 16:16:07 d2.data.common]: Serializing the dataset using: <class 'detectron2.data.common.NumpySerializedList'>
[11/20 16:16:07 d2.data.common]: Serializing 37 elements to byte tensors and concatenating them all ...
[11/20 16:16:07 d2.data.common]: Serialized dataset takes 0.01 MiB
[11/20 16:16:12 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone:
| Names in Model    | Names in Checkpoint                                                               | Shapes                                    |
|:------------------|:----------------------------------------------------------------------------------|:------------------------------------------|
| res2.0.conv1.*    | res2.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)       |
| res2.0.conv2.*    | res2.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)       |
| res2.1.conv1.*    | res2.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)       |
| res2.1.conv2.*    | res2.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (64,) (64,) (64,) (64,) (64,64,3,3)       |
| res3.0.conv1.*    | res3.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,64,3,3)  |
| res3.0.conv2.*    | res3.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3) |
| res3.0.shortcut.* | res3.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (128,) (128,) (128,) (128,) (128,64,1,1)  |
| res3.1.conv1.*    | res3.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3) |
| res3.1.conv2.*    | res3.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (128,) (128,) (128,) (128,) (128,128,3,3) |
| res4.0.conv1.*    | res4.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,128,3,3) |
| res4.0.conv2.*    | res4.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3) |
| res4.0.shortcut.* | res4.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (256,) (256,) (256,) (256,) (256,128,1,1) |
| res4.1.conv1.*    | res4.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3) |
| res4.1.conv2.*    | res4.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (256,) (256,) (256,) (256,) (256,256,3,3) |
| res5.0.conv1.*    | res5.0.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,256,3,3) |
| res5.0.conv2.*    | res5.0.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3) |
| res5.0.shortcut.* | res5.0.shortcut.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight} | (512,) (512,) (512,) (512,) (512,256,1,1) |
| res5.1.conv1.*    | res5.1.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3) |
| res5.1.conv2.*    | res5.1.conv2.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}    | (512,) (512,) (512,) (512,) (512,512,3,3) |
| stem.conv1.*      | stem.conv1.{norm.bias,norm.running_mean,norm.running_var,norm.weight,weight}      | (64,) (64,) (64,) (64,) (64,3,7,7)        |
[11/20 16:16:14 d2.engine.train_loop]: Starting training from iteration 0
ERROR [11/20 16:16:22 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/opt/conda/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 274, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/meta_arch/semantic_seg.py", line 108, in forward
    features = self.backbone(images.tensor)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/backbone/resnet.py", line 445, in forward
    x = self.stem(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/detectron2/modeling/backbone/resnet.py", line 356, in forward
    x = self.conv1(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/detectron2/layers/wrappers.py", line 117, in forward
    x = self.norm(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
    return _get_group_size(group)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
    default_pg = _get_default_group()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 430, in _get_default_group
    "Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
[11/20 16:16:22 d2.engine.hooks]: Total training time: 0:00:08 (0:00:00 on hooks)
[11/20 16:16:22 d2.utils.events]:  iter: 0    lr: N/A  max_mem: 7348M

Environment:

  • Kaggle platform with both accelerators: GPU T4 x2 and GPU P100:
----------------------  -------------------------------------------------------------------------------
sys.platform            linux
Python                  3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0]
numpy                   1.21.6
detectron2              0.6 @/opt/conda/lib/python3.7/site-packages/detectron2
Compiler                GCC 9.4
CUDA compiler           CUDA 11.0
detectron2 arch flags   7.5
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.11.0 @/opt/conda/lib/python3.7/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0,1                 Tesla T4 (arch=7.5)
Driver version          470.82.01
CUDA_HOME               /usr/local/cuda
Pillow                  9.1.1
torchvision             0.12.0 @/opt/conda/lib/python3.7/site-packages/torchvision
torchvision arch flags  3.7, 6.0, 7.0, 7.5
fvcore                  0.1.5.post20220512
iopath                  0.1.9
cv2                     4.5.4
----------------------  -------------------------------------------------------------------------------
PyTorch built with:
  - GCC 9.4
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_70,code=compute_70;-gencode;arch=compute_75,code=compute_75
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.0, CUDNN_VERSION=8.0.5, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

Testing NCCL connectivity ... this should not hang.
NCCL succeeded.
  • Google Colab:
----------------------  ----------------------------------------------------------------
sys.platform            linux
Python                  3.7.15 (default, Oct 12 2022, 19:14:55) [GCC 7.5.0]
numpy                   1.21.6
detectron2              0.6 @/usr/local/lib/python3.7/dist-packages/detectron2
Compiler                GCC 7.5
CUDA compiler           CUDA 11.2
detectron2 arch flags   7.5
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.12.1+cu113 @/usr/local/lib/python3.7/dist-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0                   Tesla T4 (arch=7.5)
Driver version          460.32.03
CUDA_HOME               /usr/local/cuda
Pillow                  7.1.2
torchvision             0.13.1+cu113 @/usr/local/lib/python3.7/dist-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.5.post20220512
iopath                  0.1.9
cv2                     4.6.0
----------------------  ----------------------------------------------------------------
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

Solution

  • It was due to very SyncBatchNorm layers that I knew can bother on single GPU as a drawback in pytorch, but seemingly I was not able to find in the log!

    Stated here as a workaround:

    1. print the config
    2. find all keys that has a value of "SyncBN" or similar
    3. Edit the config file or code to set these values to "BN" instead