python numpy machine-learning scikit-learn linear-regression

Linear Regression loss value increasing after each iteration of gradient descent

I am trying to implement multivariate linear regression(gradient descent and mse cost function) but the loss value keeps exponentially increasing for every iteration of gradient descent and I'm unable to figure out why?

from sklearn.datasets import load_boston


class LinearRegression:

    def __init__(self):
        self.X = None  # The feature vectors [shape = (m, n)]
        self.y = None  # The regression outputs [shape = (m, 1)]
        self.W = None  # The parameter vector `W` [shape = (n, 1)]
        self.bias = None  # The bias value `b`
        self.lr = None  # Learning Rate `alpha`
        self.m = None
        self.n = None
        self.epochs = None

    def fit(self, X: np.ndarray, y: np.ndarray, epochs: int = 100, lr: float = 0.001):
        self.X = X  # shape (m, n)
        self.m, self.n = X.shape
        assert y.size == self.m and y.shape[0] == self.m
        self.y = np.reshape(y, (-1, 1))  # shape (m, ) or (m, 1)
        assert self.y.shape == (self.m, 1)
        self.W = np.random.random((self.n, 1)) * 1e-3  # shape (n, 1)
        self.bias = 0.0
        self.epochs = epochs
        self.lr = lr
        self.minimize()

    def minimize(self, verbose: bool = True):
        for num_epoch in range(self.epochs):
            predictions = np.dot(self.X, self.W)

            assert predictions.shape == (self.m, 1)
            grad_w = (1/self.m) * np.sum((predictions-self.y) * self.X, axis=0)[:, np.newaxis]
            self.W = self.W - self.lr * grad_w
            assert self.W.shape == grad_w.shape
            loss = (1 / 2 * self.m) * np.sum(np.square(predictions - self.y))

            if verbose:
                print(f'Epoch : {num_epoch+1}/{self.epochs} \t Loss : {loss.item()}')


linear_regression = LinearRegression()
x_train, y_train = load_boston(return_X_y=True)
linear_regression.fit(x_train, y_train, 10)

I'm using the boston housing dataset from sklearn.

PS. I'd like to know what's causing this issue and how to fix it and whether or not my implementation is correct.

Thanks

Solution

The error is in the gradient. A divergence like that for an iterative shrinkage-thresholding algorithms (ISTA) solver is not something you should see. For your gradient computation: X is of shape (m,n) and W of shape(n,1) so (prediction - y) is of shape (m,1) then you multiply by X on the left? (m,1) by (m,n)? Not sure what numpy is computing but it is not what you want to compute:

grad_w = (1/self.m) * np.sum((predictions-self.y) * self.X, axis=0)[:, np.newaxis]

here the code should be a bit different to have a (n,m) multiply by a (m,1) in order to get a (n,1), same shape as W.

(1/self.m) * np.sum(self.X.T*(predictions-self.y) , axis=0)[:, np.newaxis]

For the derivation to be correct.

I am also not sure of why you use the dot (which is a good idea) for the prediction but not for the gradient.

You Also do not need so many reshapes:

from sklearn.datasets import load_boston

A,b = load_boston(return_X_y=True)
n_samples = A.shape[0]
n_features = A.shape[1]

def grad_linreg(x):
    """Least-squares gradient"""
    grad = (1. / n_samples) * np.dot(A.T, np.dot(A, x) - b)
    return grad

def loss_linreg(x):
    """Least-squares loss"""
    f = (1. / (2. * n_samples)) * sum((b - np.dot(A, x)) ** 2)
    return f

And then you check that your gradient is good:

from scipy.optimize import check_grad
from numpy.random import randn

check_grad(loss_linreg,grad_linreg,randn(n_features))
check_grad(loss_linreg,grad_linreg,randn(n_features))
check_grad(loss_linreg,grad_linreg,randn(n_features))
check_grad(loss_linreg,grad_linreg,randn(n_features))

You can then build the Model on that. If you want to test that with ISTA/FISTA and Logistic/Linear Regression and LASSO/RIDGE, here is a jupyter notebook with the theory and a working example