Training neural network with Apache mxnet (gluon) causes program to crash

I am trying to train a convolutional neural network with mxnet using the Gluon API on a set of images I want to classify. However, the same network and code sometimes outputs extremely different results for the same data, and on occasion simply crashes and refuses to run for some reason. Here is my code:

Additional information:

Images are all 131 x 131 px size, 176 images per class (2 classes) and 40 test per class. I'm confused as to why the same program for the same data should sometimes give output but otherwise crash.

Imports

from __future__ import print_function
import mxnet as mx
import numpy as np
from mxnet import nd, autograd, gluon
import time
mx.random.seed(1)

Setting context

ctx = mx.cpu()

Defining callback transform function

def transform(data, label):
    return nd.transpose(data.astype(np.float32), (2, 0, 1))/255, label

Defining batch size and number of nodes in o/p layer

batch_size = 5
num_outputs = 2

Load training and test data

train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.ImageFolderDataset("/somepath/train", 0, transform), batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.ImageFolderDataset("/somepath/test", 0, transform), batch_size, shuffle=False)

Define CNN using gluon.nn

neural_net = gluon.nn.Sequential()
num_fc = 512

with neural_net.name_scope():
    neural_net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu'))
    neural_net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
    neural_net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
    neural_net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
    neural_net.add(gluon.nn.Flatten())
    neural_net.add(gluon.nn.Dense(num_fc, activation="relu"))
    neural_net.add(gluon.nn.Dense(num_outputs))

Initialize params, loss fn, and trainer object

neural_net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(neural_net.collect_params(), 'adadelta')

Training Loop

total_time = 0
for e in range(2):
    tick = time.time()
    for idx, (dpoint, label) in enumerate(train_data):
        data = dpoint.as_in_context(ctx)
        label = label.as_in_context(ctx)

        with autograd.record():
            output = neural_net(data)
            loss2 = cross_entropy(output, label)

        loss2.backward()    
        trainer.step(data.shape[0])
    tock = time.time()
    print("Epoch %s. Took %s seconds to train" %(e, tock-tick))
    total_time += tock-tick
print("Total training time: %s" %(total_time))

Measuring accuracy

acc = mx.metric.Accuracy()
for idx, (data, label) in enumerate(test_data):
    something = data.as_in_context(ctx)
    something_label = label.as_in_context(ctx)

    output2 = neural_net(something)
    predictions = nd.argmax(output2, axis=1)

    acc.update(predictions, something_label)
print(acc.get()[-1])

Solution

Your network might be taking a long time to compute the forward and backward passes through the data. I tracked down the perceived unresponsiveness to the acc.update call (a little later than neural_net(...)). Digging deeper into this function, we're waiting for nd.asnumpy to resolve.

Confusion lies with the fact that MXNet NDArray computations are asynchronous. All the training forward/backward pass operations appear to resolve instantly but are in fact added to a queue to processing. It's only when data is brought back into the python process (via nd.asnumpy) that you have to wait for the relevant operations to finish. And this happens for the first time in acc.update.

Another way of benchmarking performance of certain code blocks is to use mx.nd.waitall() which blocks the code until the computation queue is empty. Adding this to your training cycle, you can see that it takes much longer than it initially appears.

Using a GPU would likely help this apparent unresponsiveness.