Search code examples
pythontensorflowmachine-learningneural-networkmxnet

Does distributed training produce NN that is average of NNs trained within each distributed node?


I'm currently sifting through a ton of material on distributed training for neural networks (training with backward propagation). And more I dig in to this material the more it appears to me that essentially every distributed neural neural network training algorithm is just a way to combine gradients produced by distributed nodes (typically done using average) with respect to constraints on execution environment (i.e. network topology, node performance equality, ...).

And all the the salt of underlying algorithms is concentrated around exploitation of assumptions on execution environment constraints with aim to reduce the overall lag and thus overall amount of time necessary to complete the training.

So if we're just combining gradients with distributed training using averaging of weights in some clever way then the whole process training is (more or less) equivalent to averaging of networks resulted by training within every distributed node.

If I'm right with things described above then I would like to try combining weights produced by distributed nodes by hand.

So my question is: How do you produce an average of two or more neural network weights using any mainstream technology such as tensorflow / caffe / mxnet / ...

Thank you in advance

EDIT @Matias Valdenegro

Matias I understand what you are saying: You mean that as soon as you apply the gradient new gradient will change and thus it is not possible to do the parallelization because old gradients has no relation to new updated weights. So real world algorithms evaluate gradients, average them and then apply them.

Now if you just expand parenthesis in this mathematical operation then you would notice that you can apply the gradients locally. Essentially there's no difference if you average the deltas (vectors) or averaging NN states (points). Please refer to diagram below:

enter image description here

Suppose that NN weights are a 2-D vector.

Initial state  = (0, 0)
Deltas 1       = (1, 1)
Deltas 2       = (1,-1)
-----------------------
Average deltas = (1, 1) * 0.5 + (1, -1) * 0.5 = (1, 0)
NN State       = (0, 0) - (1, 0) = (-1, 0)

Now the same result can be achieved if gradients were applied locally on a node and the central node would average the weights instead of deltas:

--------- Central node 0 ---------
Initial state  = (0, 0)
----------------------------------

------------- Node 1 -------------
Deltas 1       = (1, 1)
State 1        = (0, 0) - (1,  1) = (-1, -1)
----------------------------------

------------- Node 2 -------------
Deltas 2       = (1,-1)
State 2        = (0, 0) - (1, -1) = (-1,  1)
----------------------------------

--------- Central node 0 ---------
Average state  = ((-1, -1) * 0.5 + (-1,  1) * 0.5) = (-1, 0)
----------------------------------

So the results are the same...


Solution

  • The question in the title is different that the question in the body :) I'll answer both:

    Title question: "Does distributed training produce NN that is average of NNs trained within each distributed node?"

    No. In the context of model training with minibatch SGD, distributed training usually refers to data-parallel distributed training, which distributes the computation of the gradients of a mini-batch of records over N worker, and then produces an average gradient used to update central model weights, in async or sync fashion. Historically, the averaging happened in a separate process called the parameter server (historical default in MXNet and TensorFlow), but modern approaches use a more network-frugal, peer-to-peer ring-style all-reduce, democratized by Uber's Horovod extension, initially developed for TensorFlow but now available for Keras, PyTorch and MXNet too. Note that model-parallel distributed training (having different piece of a model hosted in different devices) also exists, but data parallel training is more common in practice, possibly because simpler to implement (distributing an average is easy) and because full models often fit comfortably in memory of modern hardware. However, model parallel training is occasionally seen for very large models, such as Google's GNMT.

    Body question: "How do you produce an average of two or more neural network weights using any mainstream technology?"

    This depends on each framework API, for example:

    In TensorFlow: Tensorflow - Averaging model weights from restored models

    In PyTorch: How to take the average of the weights of two networks?

    In MXNet (dummy code assuming initialized gluon nn.Sequential() models with similar architecture):

    # create Parameter dict storing model parameters
    p1 = net1.collect_params()
    p2 = net2.collect_params()
    p3 = net3.collect_params()
    
    for k1, k2, k3 in zip(p1, p2, p3):
        p3[k3].set_data(0.5*(p1[k1].data() + p2[k2].data()))