Search code examples
machine-learningcaffebus-error

What is causing Caffe to throw a Bus error


For an experiment I have been running, Caffe has been crashing. My experiment involves training networks on different subsets of the same data using the AlexNet model. For each trial, I generate an LMDB for that particular subset of data and then modify my network .prototxt to match the parameters. For 40+ trials, I have had no issue. One particular trial, however, consistently crashes after 227 training iterations. The error given is simply "Bus error (core dumped)". This happens regardless of whether I do the training on GPU or CPU. Searching has yielded no results of anyone else who has had this error. Apparently it is some sort of memory addressing error. I am using an Nvidia DIGITS box with 64GB RAM and and 12GB of VRAM. The system monitor shows that I am using nowhere near the system's full memory. I can provide my prototxt if it might be helpful. However, the dataset is too large too upload (>20GB).

I1128 12:50:01.558748 20000 solver.cpp:228] Iteration 227, loss = 5.8273
I1128 12:50:01.558786 20000 solver.cpp:244] Train net output #0: loss = 5.8273 (* 1 = 5.8273 loss)
I1128 12:50:01.558796 20000 sgd_solver.cpp:106] Iteration 227, lr = 0.001 Bus error (core dumped)

According to this question, bus errors are nonexistant on modern Intel machines, which I am using. What could be causing this problem?


Solution

  • I discovered the cause. I was using a different computer to generate the LMDB and transferring it to the machine that runs caffe with a flash drive. For some reason, transferring files to this flash drive lead to the lmdb being truncated from ~20GB to 15GB with no warning to me. I think that caffe seems to have crashed when it reached the unexpected end of the lmdb. Retransferring the file and ensuring that it wasn't truncated solved the problem.