Search code examples
tensorflowdeep-learningmnist

Batch training uses sum of updates? or average of updates?


I have few questions about batch training of neural networks.

First, when we update weights using batch training, the amount of change is accumulated gradients for the batch size. In this case, the amount of change is sum of the gradients? or average of the gradients?

If the answer is the sum of the gradients, the amount of change will be much bigger than online training, because the amounts are accumulated. In this case, I don't think the weights can be optimized well.

Otherwise, if the answer is the average of the gradients, then it seems very reasonable to optimize the weights well. However, in this case, we have to train much more times than online training because it updates the weight little only once for the batch size of data.

Second, whatever the answer of the first question is, when I use CNN sample codes of Tensorflow for MNIST as following, it can optimizer the weight so fast, so the training accuracy becomes above 90% even in the second step.

=======================================================================

train_step = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) for i in range(1000): batch = mnist.train.next_batch(100) if i%100 == 0: train_accuracy = sess.run(accuracy, feed_dict={x:batch[0], y_:batch[1], keep_prob: 1.0}) sess.run(train_step, feed_dict={x: batch[0], y_:batch[1], keep_prob:1.0})

========================================================================

Please explain how does Tensorflow optimize the weight so very fast.


Solution

  • The answer to this question depends on your loss function.

    If loss_element is your loss function for one element of the batch, then, the loss of your batch will be some function of all your individual losses.

    For example if you choose to use tf.reduce_mean, then your loss is averaged on all the elements of your batch. And so is the gradient. If you use tf.reduce_sum, then your gradient will be the sum of all your gradients element-wise.