Search code examples
c++out-of-memorycaffetraining-datamobilenet

Aborted Caffe training - no error message


I'd like to train my network with caffe but unfortunately, when I try to run train.sh, the process gets aborted showing no specific error message. I already created my pre-trained weights, my model.prototxt and the LMDB database which I checked to see if it is alright. So here's my whole console output (only the interesting parts because of character limit):

I0504 06:37:33.873118 50237 caffe.cpp:210] Use CPU.
I0504 06:37:33.874349 50237 solver.cpp:63] Initializing solver from parameters: 
train_net: "example/MobileNetSSD_train.prototxt"
test_net: "example/MobileNetSSD_test.prototxt"
test_iter: 673
test_interval: 10000
base_lr: 0.0005
display: 10
max_iter: 120000
lr_policy: "multistep"
gamma: 0.5
weight_decay: 5e-05
snapshot: 1000
snapshot_prefix: "snapshot/mobilenet"
solver_mode: CPU
debug_info: false
train_state {
  level: 0
  stage: ""
}
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 20000
stepvalue: 40000
iter_size: 1
type: "RMSProp"
eval_type: "detection"
ap_version: "11point"
I0504 06:37:33.875725 50237 solver.cpp:96] Creating training net from train_net file: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876616 50237 upgrade_proto.cpp:77] Attempting to upgrade batch norm layers using deprecated params: example/MobileNetSSD_train.prototxt
I0504 06:37:33.876662 50237 upgrade_proto.cpp:80] Successfully upgraded batch norm layers using deprecated params.
I0504 06:37:33.876909 50237 net.cpp:58] Initializing net from parameters: 
name: "MobileNet-SSD"
state {
  phase: TRAIN
  level: 0
  stage: ""
}
layer {
  name: "data"
  type: "AnnotatedData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.007843
    mirror: true
    mean_value: 127.5
    mean_value: 127.5
    mean_value: 127.5
    resize_param {
      prob: 1
      resize_mode: WARP
      height: 300
      width: 300
      interp_mode: LINEAR
      interp_mode: AREA
      interp_mode: NEAREST
      interp_mode: CUBIC
      interp_mode: LANCZOS4
    }
    emit_constraint {
      emit_type: CENTER
    }
    distort_param {
      brightness_prob: 0.5
      brightness_delta: 32
      contrast_prob: 0.5
      contrast_lower: 0.5
      contrast_upper: 1.5
      hue_prob: 0.5
      hue_delta: 18
      saturation_prob: 0.5
      saturation_lower: 0.5
      saturation_upper: 1.5
      random_order_prob: 0
    }
    expand_param {
      prob: 0.5
      max_expand_ratio: 4
    }
  }
  data_param {
    source: "trainval_lmdb/"
    batch_size: 24
    backend: LMDB
  }
  annotated_data_param {
    batch_sampler {
      max_sample: 1
      max_trials: 1
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.1
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.3
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.5
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.7
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        min_jaccard_overlap: 0.9
      }
      max_sample: 1
      max_trials: 50
    }
    batch_sampler {
      sampler {
        min_scale: 0.3
        max_scale: 1
        min_aspect_ratio: 0.5
        max_aspect_ratio: 2
      }
      sample_constraint {
        max_jaccard_overlap: 1
      }
      max_sample: 1
      max_trials: 50
    }
    label_map_file: "labelmap.prototxt"
  }
}
layer {
  name: "conv0"
  type: "Convolution"
  bottom: "data"
  top: "conv0"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  convolution_param {
    num_output: 32
    bias_term: false
    pad: 1
    kernel_size: 3
    stride: 2
    weight_filler {
      type: "msra"
    }
  }
}
layer {
  name: "conv0/bn"
  type: "BatchNorm"
  bottom: "conv0"
  top: "conv0"
}
layer {
  name: "conv0/scale"
  type: "Scale"
  bottom: "conv0"
  top: "conv0"
  param {
    lr_mult: 0.1
    decay_mult: 0
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  scale_param {
    filler {
      value: 1
    }
    bias_term: true
    bias_filler {
      value: 0
    }
  }
}
layer {
  name: "conv0/relu"
  type: "ReLU"
  bottom: "conv0"
  top: "conv0"
}
layer {
  name: "conv1/dw"
  type: "Convolution"
  bottom: "conv0"
  top: "conv1/dw"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  convolution_param {
    num_output: 32
    bias_term: false
    pad: 1
    kernel_size: 3
    group: 32
    weight_filler {
      type: "msra"
    }
    engine: CAFFE
  }
}
layer {
  name: "conv1/dw/bn"
  type: "BatchNorm"
  bottom: "conv1/dw"
  top: "conv1/dw"
}
layer {
  name: "conv1/dw/scale"
  type: "Scale"
  bottom: "conv1/dw"
  top: "conv1/dw"
  param {
    lr_mult: 0.1
    decay_mult: 0
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  scale_param {
    filler {
      value: 1
    }
    bias_term: true
    bias_filler {
      value: 0
    }
  }
}

[...]

layer {
  name: "conv17_2/relu"
  type: "ReLU"
  bottom: "conv17_2"
  top: "conv17_2"
}
layer {
  name: "conv11_mbox_loc"
  type: "Convolution"
  bottom: "conv11"
  top: "conv11_mbox_loc"
  param {
    lr_mult: 0.1
    decay_mult: 0.1
  }
  param {
    lr_mult: 0.2
    decay_mult: 0
  }
  convolution_param {
    num_output: 12
    kernel_size: 1
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "conv11_mbox_loc_perm"
  type: "Permute"
  bottom: "conv11_mbox_loc"
  top: "conv11_mbox_loc_perm"
  permute_param {
    order: 0
    or
I0504 06:37:33.890111 50237 layer_factory.hpp:77] Creating layer data
I0504 06:37:33.890482 50237 net.cpp:100] Creating Layer data
I0504 06:37:33.890534 50237 net.cpp:408] data -> data
I0504 06:37:33.890727 50239 db_lmdb.cpp:35] Opened lmdb trainval_lmdb/
I0504 06:37:33.891376 50237 net.cpp:408] data -> label
I0504 06:37:33.895253 50237 annotated_data_layer.cpp:62] output data size: 24,3,300,300
I0504 06:37:33.895355 50237 net.cpp:150] Setting up data
I0504 06:37:33.895393 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.895494 50237 net.cpp:157] Top shape: 1 1 1 8 (8)
I0504 06:37:33.895525 50237 net.cpp:165] Memory required for data: 25920032
I0504 06:37:33.895558 50237 layer_factory.hpp:77] Creating layer data_data_0_split
I0504 06:37:33.895594 50237 net.cpp:100] Creating Layer data_data_0_split
I0504 06:37:33.895627 50237 net.cpp:434] data_data_0_split <- data
I0504 06:37:33.895660 50237 net.cpp:408] data_data_0_split -> data_data_0_split_0
I0504 06:37:33.895694 50237 net.cpp:408] data_data_0_split -> data_data_0_split_1
I0504 06:37:33.895726 50237 net.cpp:408] data_data_0_split -> data_data_0_split_2
I0504 06:37:33.895757 50237 net.cpp:408] data_data_0_split -> data_data_0_split_3
I0504 06:37:33.895817 50237 net.cpp:408] data_data_0_split -> data_data_0_split_4
I0504 06:37:33.895853 50237 net.cpp:408] data_data_0_split -> data_data_0_split_5
I0504 06:37:33.895884 50237 net.cpp:408] data_data_0_split -> data_data_0_split_6
I0504 06:37:33.895965 50237 net.cpp:150] Setting up data_data_0_split
I0504 06:37:33.896008 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896039 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896068 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896113 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896143 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896173 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896201 50237 net.cpp:157] Top shape: 24 3 300 300 (6480000)
I0504 06:37:33.896230 50237 net.cpp:165] Memory required for data: 207360032
I0504 06:37:33.896277 50237 layer_factory.hpp:77] Creating layer conv0
I0504 06:37:33.896404 50237 net.cpp:100] Creating Layer conv0
I0504 06:37:33.896438 50237 net.cpp:434] conv0 <- data_data_0_split_0
I0504 06:37:33.896469 50237 net.cpp:408] conv0 -> conv0
I0504 06:37:33.897195 50237 net.cpp:150] Setting up conv0
I0504 06:37:33.897239 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897289 50237 net.cpp:165] Memory required for data: 276480032
I0504 06:37:33.897328 50237 layer_factory.hpp:77] Creating layer conv0/bn
I0504 06:37:33.897364 50237 net.cpp:100] Creating Layer conv0/bn
I0504 06:37:33.897394 50237 net.cpp:434] conv0/bn <- conv0
I0504 06:37:33.897423 50237 net.cpp:395] conv0/bn -> conv0 (in-place)
I0504 06:37:33.897517 50237 net.cpp:150] Setting up conv0/bn
I0504 06:37:33.897550 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.897580 50237 net.cpp:165] Memory required for data: 345600032
I0504 06:37:33.897611 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.897644 50237 net.cpp:100] Creating Layer conv0/scale
I0504 06:37:33.897672 50237 net.cpp:434] conv0/scale <- conv0
I0504 06:37:33.897701 50237 net.cpp:395] conv0/scale -> conv0 (in-place)
I0504 06:37:33.898386 50237 layer_factory.hpp:77] Creating layer conv0/scale
I0504 06:37:33.898525 50237 net.cpp:150] Setting up conv0/scale
I0504 06:37:33.898561 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898591 50237 net.cpp:165] Memory required for data: 414720032
I0504 06:37:33.898622 50237 layer_factory.hpp:77] Creating layer conv0/relu
I0504 06:37:33.898654 50237 net.cpp:100] Creating Layer conv0/relu
I0504 06:37:33.898684 50237 net.cpp:434] conv0/relu <- conv0
I0504 06:37:33.898712 50237 net.cpp:395] conv0/relu -> conv0 (in-place)
I0504 06:37:33.898746 50237 net.cpp:150] Setting up conv0/relu
I0504 06:37:33.898777 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.898805 50237 net.cpp:165] Memory required for data: 483840032
I0504 06:37:33.898833 50237 layer_factory.hpp:77] Creating layer conv1/dw
I0504 06:37:33.898864 50237 net.cpp:100] Creating Layer conv1/dw
I0504 06:37:33.898893 50237 net.cpp:434] conv1/dw <- conv0
I0504 06:37:33.898922 50237 net.cpp:408] conv1/dw -> conv1/dw
I0504 06:37:33.898962 50237 net.cpp:150] Setting up conv1/dw
I0504 06:37:33.898993 50237 net.cpp:157] Top shape: 24 32 150 150 (17280000)
I0504 06:37:33.899021 50237 net.cpp:165] Memory required for data: 552960032
I0504 06:37:33.899050 50237 layer_factory.hpp:77] Creating layer conv1/dw/bn

[...]

I0504 06:37:33.985625 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.985718 50237 net.cpp:100] Creating Layer conv13/dw/scale
    @     0x7f192267c2c0  caffe::GenerateBatchSamples()
I0504 06:37:33.987087 50237 net.cpp:434] conv13/dw/scale <- conv13/dw
I0504 06:37:33.987202 50237 net.cpp:395] conv13/dw/scale -> conv13/dw (in-place)
I0504 06:37:33.987262 50237 layer_factory.hpp:77] Creating layer conv13/dw/scale
I0504 06:37:33.987337 50237 net.cpp:150] Setting up conv13/dw/scale
I0504 06:37:33.987366 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987393 50237 net.cpp:165] Memory required for data: 3753455648
I0504 06:37:33.987419 50237 layer_factory.hpp:77] Creating layer conv13/dw/relu
I0504 06:37:33.987447 50237 net.cpp:100] Creating Layer conv13/dw/relu
I0504 06:37:33.987470 50237 net.cpp:434] conv13/dw/relu <- conv13/dw
I0504 06:37:33.987504 50237 net.cpp:395] conv13/dw/relu -> conv13/dw (in-place)
I0504 06:37:33.987534 50237 net.cpp:150] Setting up conv13/dw/relu
I0504 06:37:33.987557 50237 net.cpp:157] Top shape: 24 1024 10 10 (2457600)
I0504 06:37:33.987582 50237 net.cpp:165] Memory required for data: 3763286048
I0504 06:37:33.987607 50237 layer_factory.hpp:77] Creating layer conv13
I0504 06:37:33.987639 50237 net.cpp:100] Creating Layer conv13
I0504 06:37:33.987665 50237 net.cpp:434] conv13 <- conv13/dw
I0504 06:37:33.987691 50237 net.cpp:408] conv13 -> conv13
    @     0x7f19226dc732  caffe::AnnotatedDataLayer<>::load_batch()
    @     0x7f19226e000a  caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
    @     0x7f191ec9fbcd  (unknown)
    @     0x7f191c4326db  start_thread
    @     0x7f19210eb88f  clone
Aborted (core dumped)

I suppose it could be memory problems because it fails inmidst of buildinf the conv layers (I am training on CPU), but I already have my batch size at 24. Does anyone know what exactly causes this problems and how to fix it? Thanks!


Solution

  • After spending way too much time on this problem and trying endless solutions, I finally found what causes this issue. This error is particularly treacherous as in the most cases, it decides simply not to give an error message.

    See the original thread here: https://github.com/weiliu89/caffe/issues/669#issuecomment-339542120

    Before compiling, you must edit the source code a little bit. Go to caffe/src/caffe/util/math_functions.cpp and in line 247, you find this function, which you should edit to look like this:

    void caffe_rng_uniform(const int n, Dtype a, Dtype b, Dtype* r) {
      CHECK_GE(n, 0);
      CHECK(r);
      
      if (a > b) {
        Dtype c = a;
        a = b;
        b = c;
      }
      CHECK_LE(a, b);
      boost::uniform_real<Dtype> random_distribution(a, caffe_nextafter<Dtype>(b));
      boost::variate_generator<caffe::rng_t*, boost::uniform_real<Dtype> >
          variate_generator(caffe_rng(), random_distribution);
      for (int i = 0; i < n; ++i) {
        r[i] = variate_generator();
      }
    }
    

    Note that I just added an if statement (that switches the variables a and b if a is larger than b) and removed the const flag in the parameter's line from Dtype a and Dtype b. Then simply do:

    make clean
    make -j$(nproc)
    make py -j$(nproc)
    make test -j$(nproc)
    make runtest -j$(nproc) # You should run the tests after compiling to make sure you don't run into any other unexpected error.
    

    For me, this worked very well!