machine-learning neural-network gradient-descent

SGD mini batches - all of the same size?

Stochastic Gradient Descent algorithms with mini batches usually use mini batches' size or count as a parameter.

Now what I'm wondering, do all of the mini-batches need to be of exact same size?

Take for example a training data from MNIST(60k training images) and a mini-batch size of 70.

If we are going in a simple loop, that produces us 857 mini-batches of size 70 (as specified) and one mini-batch of size 10.

Now, does it even matter that (using this approach) one mini-batch will be smaller than the others (worst case scenario here: mini batch of size 1)? Will this strongly affect the weights and biases that our network has learned over almost all its' training?

Solution

No, mini batches do not have to be the same size. They are usually constant sized for efficiency reasons (you do not have to reallocate memory/resize tensors). In practise you could even sample size of the batch in each iteration.

However, the size of the batch makes a difference. It is hard to say which one is the best, but using smaller/bigger batch sizes can result in different solutions (and always - different convergence speed). This is an effect of dealing with more stochastic motion (small batch) vs smooth updates (good gradient estimators). In particular - doing stochastic size of a batch with some predefined distribution of sizes can be used to use both effects at the same time (but time spent fitting this distribution might be not worth it)