machine-learning computer-vision neural-network gradient-descent

Gradient Descent vs Stochastic Gradient Descent algorithms

I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples).

I each time iterated over all the training samples, performing Backpropagation for each such sample on every epoch. The runtime is of course too long.

Is the algorithm I ran named Gradient Descent?

I read that for large datasets, using Stochastic Gradient Descent can improve the runtime dramatically.

What should I do in order to use Stochastic Gradient Descent? Should I just pick the training samples randomly, performing Backpropagation on each randomly picked sample, instead of the epochs I currently use?

Solution

The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

The 3 most common flavors according to this document are (Your flavor is C):

randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
    for training sample i:
        compute gradients and perform weight updates

for one or more epochs, or until approx. cost minimum is reached:
    randomly shuffle samples in the training set
    for training sample i:
        compute gradients and perform weight updates

for iterations t, or until approx. cost minimum is reached:
    draw random sample from the training set
    compute gradients and perform weight updates