Search code examples
tensorflowkerasdeep-learninggpucustom-training

OOM when allocating tensor with shape[1,48,48,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc


I'm trying to reproduce the training of the Mask RCNN in the following repository:https://github.com/maxkferg/metal-defect-detection

Code snippet for the train is the following:

        # Training - Stage 1
        print("Training network heads")
        model.train(dataset_train, dataset_val,
        learning_rate=config.LEARNING_RATE,
        epochs=40,
        layers='heads')

        # Training - Stage 2
        # Finetune layers from ResNet stage 4 and up
        print("Fine tune Resnet stage 4 and up")
        model.train(dataset_train, dataset_val,
        learning_rate=config.LEARNING_RATE,
        epochs=120,
        layers='4+')

        # # Training - Stage 3
        # # Fine tune all layers
        print("Fine tune all layers")
        model.train(dataset_train, dataset_val,
        learning_rate=config.LEARNING_RATE / 10,
        epochs=160,
        layers='all')

Stage-1 goes smooth. But fails from the Stage-2. Giving the following:

2020-08-17 15:53:10.685456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 123 Chunks of size 2048 totalling 246.0KiB 2020-08-17 15:53:10.685456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2816 totalling 2.8KiB 2020-08-17 15:53:10.686456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 3072 totalling 18.0KiB 2020-08-17 15:53:10.686456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 387 Chunks of size 4096 totalling 1.51MiB 2020-08-17 15:53:10.687456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 6144 totalling 6.0KiB 2020-08-17 15:53:10.687456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 6656 totalling 6.5KiB 2020-08-17 15:53:10.688456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 60 Chunks of size 8192 totalling 480.0KiB 2020-08-17 15:53:10.688456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 9216 totalling 18.0KiB 2020-08-17 15:53:10.689456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 12 Chunks of size 12288 totalling 144.0KiB 2020-08-17 15:53:10.689456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 16384 totalling 32.0KiB 2020-08-17 15:53:10.690456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 21248 totalling 20.8KiB 2020-08-17 15:53:10.691456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 24064 totalling 23.5KiB 2020-08-17 15:53:10.691456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 5 Chunks of size 24576 totalling 120.0KiB 2020-08-17 15:53:10.692456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 37632 totalling 36.8KiB 2020-08-17 15:53:10.692456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 40960 totalling 40.0KiB 2020-08-17 15:53:10.693456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 4 Chunks of size 49152 totalling 192.0KiB 2020-08-17 15:53:10.693456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 65536 totalling 384.0KiB 2020-08-17 15:53:10.694456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 81920 totalling 80.0KiB 2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 90624 totalling 88.5KiB 2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 131072 totalling 128.0KiB 2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 3 Chunks of size 147456 totalling 432.0KiB 2020-08-17 15:53:10.696456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 12 Chunks of size 262144 totalling 3.00MiB 2020-08-17 15:53:10.696456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 327680 totalling 320.0KiB 2020-08-17 15:53:10.697457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 11 Chunks of size 524288 totalling 5.50MiB 2020-08-17 15:53:10.697457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 4 Chunks of size 589824 totalling 2.25MiB 2020-08-17 15:53:10.698457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 194 Chunks of size 1048576 totalling 194.00MiB 2020-08-17 15:53:10.699457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 17 Chunks of size 2097152 totalling 34.00MiB 2020-08-17 15:53:10.699457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2211840 totalling 2.11MiB 2020-08-17 15:53:10.700457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 146 Chunks of size 2359296 totalling 328.50MiB 2020-08-17 15:53:10.701457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2360320 totalling 2.25MiB 2020-08-17 15:53:10.701457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2621440 totalling 2.50MiB 2020-08-17 15:53:10.702457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2698496 totalling 2.57MiB 2020-08-17 15:53:10.702457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 3670016 totalling 3.50MiB 2020-08-17 15:53:10.703457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 31 Chunks of size 4194304 totalling 124.00MiB 2020-08-17 15:53:10.703457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 4718592 totalling 27.00MiB 2020-08-17 15:53:10.704457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 5 Chunks of size 8388608 totalling 40.00MiB 2020-08-17 15:53:10.705457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 25 Chunks of size 9437184 totalling 225.00MiB 2020-08-17 15:53:10.705457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 9438208 totalling 18.00MiB 2020-08-17 15:53:10.706457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 9441280 totalling 9.00MiB 2020-08-17 15:53:10.706457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 16138752 totalling 15.39MiB 2020-08-17 15:53:10.707457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 18874368 totalling 18.00MiB 2020-08-17 15:53:10.707457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 37748736 totalling 36.00MiB 2020-08-17 15:53:10.708457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 7 Chunks of size 51380224 totalling 343.00MiB 2020-08-17 15:53:10.708457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:684] Sum Total of in-use chunks: 1.41GiB 2020-08-17 15:53:10.709457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:686] Stats: Limit: 1613615104 InUse: 1510723072 MaxInUse: 1510723072 NumAllocs: 3860 MaxAllocSize: 119947776

The training is running on a QuadroK420 with 2GB of RAM. Is only a problem of low RAM or I'm missing something? There is a way to train also with my equippement?


Solution

  • The problem is the gpu memory of your video card.

    In the first stage you were able to train smoothly, because of the fact that you trained only "the heads" of the network, which translates to a smaller number of parameters.

    In the second stage, you started to get out of memory problems because you trained many more layers, resulting in out of memory.

    I suggest using a video card with at least 8 GB VRAM for Computer Vision problems.

    Indeed, sometimes out of memory problems can be solved by reducing the batch size, but in your case the only viable solution is to opt for a bigger/better video card.