python numpy linear-regression least-squares gradient-descent

LMS batch gradient descent with NumPy

I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).

import numpy as np

y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01

for i in range(1,1000):
    h = np.dot(x, theta)
    gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
    theta -= alpha * gradient
    print ((h - y)**2).sum(), theta.squeeze().tolist()

Solution

The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.

You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple

import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm

That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.

theta = (x_norm*[0., .1, 50.]).reshape(3, 1)

With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.

alpha = .1
for i in range(1, 100000):
    h = np.dot(x, theta)
    gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
    theta -= alpha * gradient

Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.

print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01   6.38435468e-02   1.03435881e+02]

At this point, let's cheat and check our results

theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
#                         [  6.38433756e-02]
#                         [  1.03436047e+02]]

A general introductory reference on feature scaling is this Stanford lecture.