Search code examples
machine-learningcomputer-visionneural-networkgradient-descent

Gradient Descent vs Stochastic Gradient Descent algorithms


I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples).

I each time iterated over all the training samples, performing Backpropagation for each such sample on every epoch. The runtime is of course too long.

  • Is the algorithm I ran named Gradient Descent?

I read that for large datasets, using Stochastic Gradient Descent can improve the runtime dramatically.

  • What should I do in order to use Stochastic Gradient Descent? Should I just pick the training samples randomly, performing Backpropagation on each randomly picked sample, instead of the epochs I currently use?

Solution

  • The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

    The 3 most common flavors according to this document are (Your flavor is C):

    A)

    randomly shuffle samples in the training set
    for one or more epochs, or until approx. cost minimum is reached:
        for training sample i:
            compute gradients and perform weight updates
    

    B)

    for one or more epochs, or until approx. cost minimum is reached:
        randomly shuffle samples in the training set
        for training sample i:
            compute gradients and perform weight updates
    

    C)

    for iterations t, or until approx. cost minimum is reached:
        draw random sample from the training set
        compute gradients and perform weight updates