c++out-of-memory caffe training-data mobilenet

Aborted Caffe training - no error message

I'd like to train my network with caffe but unfortunately, when I try to run train.sh, the process gets aborted showing no specific error message. I already created my pre-trained weights, my model.prototxt and the LMDB database which I checked to see if it is alright. So here's my whole console output (only the interesting parts because of character limit):

I0504 06:37:33.873118 50237 caffe.cpp:210] Use CPU.
I0504 06:37:33.874349 50237 solver.cpp:63] Initializing solver from parameters: 
train_net: "example/MobileNetSSD_train.prototxt"
test_net: "example/MobileNetSSD_test.prototxt"
test_iter: 673
test_interval: 10000
base_lr: 0.0005
display: 10
max_iter: 120000
lr_policy: "multistep"
gamma: 0.5
weight_decay: 5e-05
snapshot: 1000
snapshot_prefix: "snapshot/mobilenet"
solver_mode: CPU
debug_info: false
train_state {
  level: 0
  stage: ""
}
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 20000
stepvalue: 40000
iter_size: 1
type: "RMSProp"
eval_type: "detection"
ap_version: "11point"
I0504 06:37:33.875725 50237 solver.cpp:96] Creating training net from train_net file: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876616 50237 upgrade_proto.cpp:77] Attempting to upgrade batch norm layers using deprecated params: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876662 50237 upgrade_proto.cpp:80] Successfully upgraded batch norm layers using deprecated params.
I0504 06:37:33.876909 50237 net.cpp:58] Initializing net from parameters: 
name: "MobileNet-SSD"
state {
  phase: TRAIN
  level: 0
  stage: ""
}
layer {
  name: "data"
  type: "AnnotatedData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.007843
    mirror: true
    mean_value: 127.5
    mean_value: 127.5
    mean_value: 127.5
    resize_param {
      prob: 1
      resize_mode: WARP
      height: 300
      width: 300
      interp_mode: LINEAR
      interp_mode: AREA
      interp_mode: NEAREST
      interp_mode: CUBIC
      interp_mode: LANCZOS4
    }
    emit_constraint {
      emit_type: CENTER
    }
    distort_param {
      brightness_prob: 0.5
      brightness_delta: 32
      contrast_prob: 0.5
      contrast_lower: 0.5
      contrast_upper: 1.5
      hue_prob: 0.5
      hue_delta: 18
      saturation_prob: 0.5
      saturation_lower: 0.5
      saturation_upper: 1.5
      random_order_prob: 0
    }
    expand_param {
      prob: 0.5
      max_expand_ratio: 4
    }
  }
  data_param {
    source: "trainval_lmdb/"
    batch_size: 24
    backend: LMDB
  }
  annotated_data_param {
    batch_sampler {
      max_sample: 1
      max_trials: 1
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.1
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.3
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.5
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.7
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.9
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        max_jaccard_overlap: 1
      }
      max_sample: 1
      max_trials: 50
    }
    label_map_file: "labelmap.prototxt"
  }
}
layer {
  name: "conv0"
  type: "Convolution"
  bottom: "data"
  top: "conv0"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  convolution_param {
    num_output: 32
    bias_term: false
    pad: 1
    kernel_size: 3
    stride: 2
    weight_filler {
      type: "msra"
    }
  }
}
layer {
  name: "conv0/bn"
  type: "BatchNorm"
  bottom: "conv0"
  top: "conv0"
}
layer {
  name: "conv0/scale"
  type: "Scale"
  bottom: "conv0"
  top: "conv0"
  param {
    lr_mult: 0.1
    decay_mult: 0
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  scale_param {
    filler {
      value: 1
    }
    bias_term: true
    bias_filler {
      value: 0
    }
  }
}
layer {
  name: "conv0/relu"
  type: "ReLU"
  bottom: "conv0"
  top: "conv0"
}
layer {
  name: "conv1/dw"
  type: "Convolution"
  bottom: "conv0"
  top: "conv1/dw"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  convolution_param {
    num_output: 32
    bias_term: false
    pad: 1
    kernel_size: 3
    group: 32
    weight_filler {
      type: "msra"
    }
    engine: CAFFE
  }
}
layer {
  name: "conv1/dw/bn"
  type: "BatchNorm"
  bottom: "conv1/dw"
  top: "conv1/dw"
}
layer {
  name: "conv1/dw/scale"
  type: "Scale"
  bottom: "conv1/dw"
  top: "conv1/dw"
  param {
    lr_mult: 0.1
    decay_mult: 0
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  scale_param {
    filler {
      value: 1
    }
    bias_term: true
    bias_filler {
      value: 0
    }
  }
}

[...]

layer {
  name: "conv17_2/relu"
  type: "ReLU"
  bottom: "conv17_2"
  top: "conv17_2"
}
layer {
  name: "conv11_mbox_loc"
  type: "Convolution"
  bottom: "conv11"
  top: "conv11_mbox_loc"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  convolution_param {
    num_output: 12
    kernel_size: 1
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "conv11_mbox_loc_perm"
  type: "Permute"
  bottom: "conv11_mbox_loc"
  top: "conv11_mbox_loc_perm"
  permute_param {
    order: 0
    or
I0504 06:37:33.890111 50237 layer_factory.hpp:77] Creating layer data
I0504 06:37:33.890482 50237 net.cpp:100] Creating Layer data
I0504 06:37:33.890534 50237 net.cpp:408] data -> data
I0504 06:37:33.890727 50239 db_lmdb.cpp:35] Opened lmdb trainval_lmdb/
I0504 06:37:33.891376 50237 net.cpp:408] data -> label
I0504 06:37:33.895253 50237 annotated_data_layer.cpp:62] output data size: 24,3,300,300
I0504 06:37:33.895355 50237 net.cpp:150] Setting up data
I0504 06:37:33.895393 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.895494 50237 net.cpp:157] Top shape: 1 1 1 8 (8)
I0504 06:37:33.895525 50237 net.cpp:165] Memory required for data: 25920032
I0504 06:37:33.895558 50237 layer_factory.hpp:77] Creating layer data_data_0_split
I0504 06:37:33.895594 50237 net.cpp:100] Creating Layer data_data_0_split
I0504 06:37:33.895627 50237 net.cpp:434] data_data_0_split <- data
I0504 06:37:33.895660 50237 net.cpp:408] data_data_0_split -> data_data_0_split_0
I0504 06:37:33.895694 50237 net.cpp:408] data_data_0_split -> data_data_0_split_1
I0504 06:37:33.895726 50237 net.cpp:408] data_data_0_split -> data_data_0_split_2
I0504 06:37:33.895757 50237 net.cpp:408] data_data_0_split -> data_data_0_split_3
I0504 06:37:33.895817 50237 net.cpp:408] data_data_0_split -> data_data_0_split_4
I0504 06:37:33.895853 50237 net.cpp:408] data_data_0_split -> data_data_0_split_5
I0504 06:37:33.895884 50237 net.cpp:408] data_data_0_split -> data_data_0_split_6
I0504 06:37:33.895965 50237 net.cpp:150] Setting up data_data_0_split
I0504 06:37:33.896008 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896039 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896068 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896113 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896143 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896173 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896201 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896230 50237 net.cpp:165] Memory required for data: 207360032
I0504 06:37:33.896277 50237 layer_factory.hpp:77] Creating layer conv0
I0504 06:37:33.896404 50237 net.cpp:100] Creating Layer conv0
I0504 06:37:33.896438 50237 net.cpp:434] conv0 <- data_data_0_split_0
I0504 06:37:33.896469 50237 net.cpp:408] conv0 -> conv0
I0504 06:37:33.897195 50237 net.cpp:150] Setting up conv0
I0504 06:37:33.897239 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897289 50237 net.cpp:165] Memory required for data: 276480032
I0504 06:37:33.897328 50237 layer_factory.hpp:77] Creating layer conv0/bn
I0504 06:37:33.897364 50237 net.cpp:100] Creating Layer conv0/bn
I0504 06:37:33.897394 50237 net.cpp:434] conv0/bn <- conv0
I0504 06:37:33.897423 50237 net.cpp:395] conv0/bn -> conv0 (in-place)
I0504 06:37:33.897517 50237 net.cpp:150] Setting up conv0/bn
I0504 06:37:33.897550 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897580 50237 net.cpp:165] Memory required for data: 345600032
I0504 06:37:33.897611 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.897644 50237 net.cpp:100] Creating Layer conv0/scale
I0504 06:37:33.897672 50237 net.cpp:434] conv0/scale <- conv0
I0504 06:37:33.897701 50237 net.cpp:395] conv0/scale -> conv0 (in-place)
I0504 06:37:33.898386 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.898525 50237 net.cpp:150] Setting up conv0/scale
I0504 06:37:33.898561 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898591 50237 net.cpp:165] Memory required for data: 414720032
I0504 06:37:33.898622 50237 layer_factory.hpp:77] Creating layer conv0/relu
I0504 06:37:33.898654 50237 net.cpp:100] Creating Layer conv0/relu
I0504 06:37:33.898684 50237 net.cpp:434] conv0/relu <- conv0
I0504 06:37:33.898712 50237 net.cpp:395] conv0/relu -> conv0 (in-place)
I0504 06:37:33.898746 50237 net.cpp:150] Setting up conv0/relu
I0504 06:37:33.898777 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898805 50237 net.cpp:165] Memory required for data: 483840032
I0504 06:37:33.898833 50237 layer_factory.hpp:77] Creating layer conv1/dw
I0504 06:37:33.898864 50237 net.cpp:100] Creating Layer conv1/dw
I0504 06:37:33.898893 50237 net.cpp:434] conv1/dw <- conv0
I0504 06:37:33.898922 50237 net.cpp:408] conv1/dw -> conv1/dw
I0504 06:37:33.898962 50237 net.cpp:150] Setting up conv1/dw
I0504 06:37:33.898993 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.899021 50237 net.cpp:165] Memory required for data: 552960032
I0504 06:37:33.899050 50237 layer_factory.hpp:77] Creating layer conv1/dw/bn

[...]

I0504 06:37:33.985625 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.985718 50237 net.cpp:100] Creating Layer conv13/dw/scale
    @     0x7f192267c2c0  caffe::GenerateBatchSamples()
I0504 06:37:33.987087 50237 net.cpp:434] conv13/dw/scale <- conv13/dw
I0504 06:37:33.987202 50237 net.cpp:395] conv13/dw/scale -> conv13/dw (in-place)
I0504 06:37:33.987262 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.987337 50237 net.cpp:150] Setting up conv13/dw/scale
I0504 06:37:33.987366 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987393 50237 net.cpp:165] Memory required for data: 3753455648
I0504 06:37:33.987419 50237 layer_factory.hpp:77] Creating layer conv13/dw/relu
I0504 06:37:33.987447 50237 net.cpp:100] Creating Layer conv13/dw/relu
I0504 06:37:33.987470 50237 net.cpp:434] conv13/dw/relu <- conv13/dw
I0504 06:37:33.987504 50237 net.cpp:395] conv13/dw/relu -> conv13/dw (in-place)
I0504 06:37:33.987534 50237 net.cpp:150] Setting up conv13/dw/relu
I0504 06:37:33.987557 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987582 50237 net.cpp:165] Memory required for data: 3763286048
I0504 06:37:33.987607 50237 layer_factory.hpp:77] Creating layer conv13
I0504 06:37:33.987639 50237 net.cpp:100] Creating Layer conv13
I0504 06:37:33.987665 50237 net.cpp:434] conv13 <- conv13/dw
I0504 06:37:33.987691 50237 net.cpp:408] conv13 -> conv13
    @     0x7f19226dc732  caffe::AnnotatedDataLayer<>::load_batch()
    @     0x7f19226e000a  caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
    @     0x7f191ec9fbcd  (unknown)
    @     0x7f191c4326db  start_thread
    @     0x7f19210eb88f  clone
Aborted (core dumped)

I suppose it could be memory problems because it fails inmidst of buildinf the conv layers (I am training on CPU), but I already have my batch size at 24. Does anyone know what exactly causes this problems and how to fix it? Thanks!

Solution

After spending way too much time on this problem and trying endless solutions, I finally found what causes this issue. This error is particularly treacherous as in the most cases, it decides simply not to give an error message.

See the original thread here: https://github.com/weiliu89/caffe/issues/669#issuecomment-339542120

Before compiling, you must edit the source code a little bit. Go to caffe/src/caffe/util/math_functions.cpp and in line 247, you find this function, which you should edit to look like this:

void caffe_rng_uniform(const int n, Dtype a, Dtype b, Dtype* r) {
  CHECK_GE(n, 0);
  CHECK(r);
  
  if (a > b) {
    Dtype c = a;
    a = b;
    b = c;
  }
  CHECK_LE(a, b);
  boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b));
  boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> >
      variate_generator(caffe_rng(), random_distribution);
  for (int i = 0; i < n; ++i) {
    r[i] = variate_generator();
  }
}

Note that I just added an if statement (that switches the variables a and b if a is larger than b) and removed the const flag in the parameter's line from Dtype a and Dtype b. Then simply do:

make clean
make -j$(nproc)
make py -j$(nproc)
make test -j$(nproc)
make runtest -j$(nproc) # You should run the tests after compiling to make sure you don't run into any other unexpected error.

For me, this worked very well!