I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples).
I each time iterated over all the training samples, performing Backpropagation for each such sample on every epoch. The runtime is of course too long.
I read that for large datasets, using Stochastic Gradient Descent can improve the runtime dramatically.
The new scenario you describe (performing Backpropagation on each randomly picked sample), is one common "flavor" of Stochastic Gradient Descent, as described here: https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent
The 3 most common flavors according to this document are (Your flavor is C):
A)
randomly shuffle samples in the training set
for one or more epochs, or until approx. cost minimum is reached:
for training sample i:
compute gradients and perform weight updates
B)
for one or more epochs, or until approx. cost minimum is reached:
randomly shuffle samples in the training set
for training sample i:
compute gradients and perform weight updates
C)
for iterations t, or until approx. cost minimum is reached:
draw random sample from the training set
compute gradients and perform weight updates