matlab loops neural-network gpgpu backpropagation

Matlab GPU Backpropagation

I'm implemented a neural network in MATLAB for better understanding of the topic.

I wanted to run the code on my GPU, so I initialized every matrix with gpuArray(), but got no performance boost. Moreover, sometimes the GPU is slower than the CPU. I already learned to use functions like arrayfun, pagefun and so on. In backprop I have a for loop that computes the delta error for every layer, backwards. However, the computation needs the result of the previous computation and I have no idea how to do this with *fun() functions.

My CPU is a i5-3570, my GPU is a GTX 660 Ti. I already tested GPUBench in MATLAB, GPU is x times faster than CPU, so I think the mistake is in my code.

TL;DR

How do I improve this MATLAB code for GPU computing?

    delta_output = (predicted - NN.Y) .* NN.activationGradient(predicted);
    delta_hidden(:, :, m) = (delta_output * NN.Theta_output) .* ...
                            NN.activationGradient(NN.a_hidden(:, :, m));
    for i = m-1:-1:1
        delta_hidden(:, :, i) = (delta_hidden(:, 2:end, i+1) * ...
                                 NN.Theta_hidden(:, :, i)) .* ...
                                 NN.activationGradient(NN.a_hidden(:, :, i));
    end

predicted, NN.y, NN.Theta_* are all gpuArray. I already initialized delta_* as a gpuArray but it doensn't make any difference.

Solution

The advantage of using the GPU for neural networks comes not from computing the updates for every layer at once - that's inherently serial, as you point out. It comes from being able to compute the update for the weights on thousands of neurons in each layer at once.

So I suspect that you simply do not have a large enough network to make using the GPU advantageous. What is the size of your weight matrix at each layer? If it doesn't contain at least 1000 elements, you're probably not going to see much advantage over the highly-optimised multi-core and intrinsically-vectorised computation that your CPU is doing.