In batch gradient descent the parameters were updated based on the total/average loss of all the points In Stochastic gradient descent or SGD we are updating the parameters after every point instead of one epoch. so lets say if the final point is an outlier woudnt that cause the whole fitted line to fluctuate drastically. How is it reliable . or converge on a contour like this SGD contour
While it is true that in its most pristine form SGD operates on just 1 sample point, in reality this is not the dominant practice. In practice, we use a mini-batch of say 256, 128 or 64 samples rather than operating on the full batch size containing all the samples in the database, which might be well over than 1 million samples. So clearly operating on a mini-batch of say 256 is much faster than operating on 1 million points and at the same time helps curb the variability caused due to just using 1 sample point.
A second point is that there is no final point
. One simply keeps iterating over the dataset. The learning rate for SGD is generally quite small say 1e-3. So even if a sample point happens to be an outlier, the wrong gradients will be scaled by 1e-3 and hence SGD will not be too much off the correct trajectory. When it iterates over the upcoming sample points, which are not outliers, it will again head towards the correct direction.
So altogether using a medium-sized mini-batch and using a small learning rate helps SGD to not digress a lot from the correct trajectory.
Now the word stochastic
in SGD can also imply various other measures. For example some practitioners also use gradient clipping i.e. they clamp the calculated gradient to maximum value if the gradients are well over this decided maximum threshold. You can find more on gradient clipping in this post. Now, this is just one trick amongst dozens of other techniques and if you are interested can read source code of popular implementation of SGD in PyTorch or TensorFlow.