Search code examples
machine-learningconfigurationcaffetraining-data

When training with Caffe should file lists be sorted?


When training with Caffe and not using lmdb files one has to provide listing files for the training and the validation input files. Typically these two listing files are called train.txt and val.txt. They have same structure, like so:

/path/to/a/file.jpg 0
/path/to/another/file.jpg 0
...
/path/to/another/file.jpg M
/path/to/another/file.jpg M
...
/path/to/another/file.jpg N
/path/to/another/file.jpg N

for a set of N+1 categories.

The train.txt and val.txt are then referenced in train_val.prototxt in the stanzas for TRAIN phase and TEST phase respectively.

My question: should the train.txt and val.txt be sorted by category number (ie by the numeric second field)?

Reason for asking: in examples the files are always sorted by category number. If I random sort the train.txt and val.txt files it does not break training - caffe.bin does not crash or report warnings. OTOH I don't know if caffe reads train.txt and val.txt in line-by-line order, or if it random samples them.


Solution

  • Caffe supports line-by-line order or shuffle: https://github.com/BVLC/caffe/blob/2a1c552b66f026c7508d390b526f2495ed3be594/src/caffe/layers/image_data_layer.cpp#L51

    And to enable shuffling, you need to add a shuffle: true parameter in your ImageDataLayer(https://github.com/BVLC/caffe/blob/2a1c552b66f026c7508d390b526f2495ed3be594/src/caffe/proto/caffe.proto#L810)