neural-network backpropagation gradient-descent

How can I add concurrency to neural network processing?

The basics of neural networks, as I understand them, is there are several inputs, weights and outputs. There can be hidden layers that add to the complexity of the whole thing.

If I have 100 inputs, 5 hidden layers and one output (yes or no), presumably, there will be a LOT of connections. Somewhere on the order of 100^5. To do back propagation via gradient descent seems like it will take a VERY long time.

How can I set up the back propagation in a way that is parallel (concurrent) to take advantage of multicore processors (or multiple processors).

This is a language agnostic question because I am simply trying to understand structure.

Solution

If you have 5 hidden layers (assuming with 100 nodes each) you have 5 * 100^2 weights (assuming the bias node is included in the 100 nodes), not 100^5 (because there are 100^2 weights between two consecutive layers).

If you use gradient descent, you'll have to calculate the contribution of each training sample to the gradient, so a natural way of distributing this across cores would be to spread the training sample across the cores and sum the contributions to the gradient in the end.

With backpropagation, you can use batch backpropagation (accumulate weight changes from several training samples before updating the weights, see e.g. https://stackoverflow.com/a/11415434/288875 ).

I would think that the first option is much more cache friendly (updates need to be merged only once between processors in each step).