Search code examples
machine-learningneural-networkbackpropagationgradient-descent

Is the mini-batch gradient just the sum of online gradients?


I am adapting code for training a neural network that does online training to work for mini-batches. Is the mini-batch gradient for a weight (de/dw) just the sum of the gradients for the samples in the mini-batch? Or, is it some non-linear function because of the sigmoid output functions? Or, is it the sum but divided by some number to make it smaller?

Clarification: It is better to pose this question more specifically and ask about the relationship between the full-batch gradient and online gradient. Thus, see next para:

I am using neurons with a sigmoid activation function to classify points in a 2-d space. The architecture is 2 x 10 x 10 x 1. There are 2 output classes: some points are 1 and others 0. The error is half the square of (target - output). My question is, is the full batch gradient equal to the the sum of the gradient of each sample (holding weights constant across the batch)?


Solution

  • It depends a bit on your exact cost function, but as you are using online mode, it means that your function is additive in the sense of the training samples, so the most probable way (without knowing the exact details) is to calculate the mean gradient. Of course if you just sum them up, it will be the exact same result, but will require smaller learning rate.