algorithm machine-learning linear-regression gradient-descent stochastic-process

Stochastic gradient descent converges too smoothly

As a part of my homework I was asked to implement a stochastic gradient descent in order to solve a linear regression problem (even though I have only 200 training examples). My problem is that stochastic gradient descent converges too smoothly, almost exactly as batch gradient descent, which brings me to my question: why does it look so smoothly, considering the fact that usually it's much more noisy. Is it because I use it with only 200 examples?

Convergence plots:

Stochastic gradient descent

Gradient descent

MSE with weights from stochastic gradient descent: 2.78441258841

MSE with weights from gradient descent: 2.78412631451 (identical to MSE with weights from normal equation)

My code:

def mserror(y, y_pred):

    n = y.size
    diff = y - y_pred
    diff_squared = diff ** 2
    av_er = float(sum(diff_squared))/n

    return av_er

def linear_prediction(X, w):
    return dot(X,np.transpose(w))

def gradient_descent_step(X, y, w, eta):

    n = X.shape[0]

    grad = (2.0/n) * sum(np.transpose(X) * (linear_prediction(X,w) - y), axis = 1)

    return w - eta * grad

def stochastic_gradient_step(X, y, w, train_ind, eta):

    n = X.shape[0]

    grad = (2.0/n) * np.transpose(X[train_ind]) * (linear_prediction(X[train_ind],w) - y[train_ind])

    return  w - eta * grad

def gradient_descent(X, y, w_init, eta, max_iter):

    w = w_init
    errors = []
    errors.append(mserror(y, linear_prediction(X,w)))

    for i in range(max_iter):
        w = gradient_descent_step(X, y, w, eta)
        errors.append(mserror(y, linear_prediction(X,w)))

    return w, errors

def stochastic_gradient_descent(X, y, w_init, eta, max_iter):

    n = X.shape[0] 
    w = w_init

    errors = []
    errors.append(mserror(y, linear_prediction(X,w)))

    for i in range(max_iter):

        random_ind = np.random.randint(n)

        w = stochastic_gradient_step(X, y, w, random_ind, eta)
        errors.append(mserror(y, linear_prediction(X,w)))

    return w, errors

Solution

There is nothing unusual with your graph. You should also note that your batch method takes fewer iterations to converge.

You may be letting SGD plots from neural networks cloud your view on what SGD "should" look like. Most neural networks are much more complicated models (difficult to optimize) working on harder problems. This contributes to the "jaggedness" you might be expecting.

Linear regression is a simple problem, and has a convex solution. That means any step that will lower our error rate is guaranteed to be a step toward the best possible solution. Thats a lot less complication than neural networks, and part of why you see a smooth error reduction. That's also why you see almost identical MSE. Both SGD and batch will converge to the exact same solution.

If you want to try and force some non-smoothness, you can keep increasing the learning rate eta, but that's kind of a silly exercise. Eventually you'll just reach a point where you don't converge because you always take steps past the solution.