python numpy machine-learning gradient-descent

Stochastic gradient descent implementation with Python's numpy

I have to implement stochastic gradient descent using python numpy library. For that purpose I'm given the following function definitions:

def compute_stoch_gradient(y, tx, w):
    """Compute a stochastic gradient for batch data."""

def stochastic_gradient_descent(
        y, tx, initial_w, batch_size, max_epochs, gamma):
    """Stochastic gradient descent algorithm."""

I'm also given the following help function:

def batch_iter(y, tx, batch_size, num_batches=1, shuffle=True):
    """
    Generate a minibatch iterator for a dataset.
    Takes as input two iterables (here the output desired values 'y' and the input data 'tx')
    Outputs an iterator which gives mini-batches of `batch_size` matching elements from `y` and `tx`.
    Data can be randomly shuffled to avoid ordering in the original data messing with the randomness of the minibatches.
    Example of use :
    for minibatch_y, minibatch_tx in batch_iter(y, tx, 32):
        <DO-SOMETHING>
    """
    data_size = len(y)

    if shuffle:
        shuffle_indices = np.random.permutation(np.arange(data_size))
        shuffled_y = y[shuffle_indices]
        shuffled_tx = tx[shuffle_indices]
    else:
        shuffled_y = y
        shuffled_tx = tx
    for batch_num in range(num_batches):
        start_index = batch_num * batch_size
        end_index = min((batch_num + 1) * batch_size, data_size)
        if start_index != end_index:
            yield shuffled_y[start_index:end_index], shuffled_tx[start_index:end_index]

I implemented the following two functions:

def compute_stoch_gradient(y, tx, w):
    """Compute a stochastic gradient for batch data."""
    e = y - tx.dot(w)
    return (-1/y.shape[0])*tx.transpose().dot(e)


def stochastic_gradient_descent(y, tx, initial_w, batch_size, max_epochs, gamma):
    """Stochastic gradient descent algorithm."""
    ws = [initial_w]
    losses = []
    w = initial_w
    for n_iter in range(max_epochs):
        for minibatch_y,minibatch_x in batch_iter(y,tx,batch_size):
            w = ws[n_iter] - gamma * compute_stoch_gradient(minibatch_y,minibatch_x,ws[n_iter])
            ws.append(np.copy(w))
            loss = y - tx.dot(w)
            losses.append(loss)

    return losses, ws

I'm not sure the iteration should be done in range(max_epochs) or in a larger range. I say this because I read that an epoch is "each time we run through the entire data set". So I think an epoch consists on more of one iteration...

Solution

In a typical implementation, a mini-batch gradient descent with batch size B should pick B data points from the dataset randomly and update the weights based on the computed gradients on this subset. This process itself will continue many number of times until convergence or some threshold maximum iteration. Mini-batch with B=1 is SGD which can be noisy sometimes.

Along with the above comments, you may want to play with the batch size and the learning rate (step size) since they have significance impact on the convergence rate of stochastic and mini-batch gradient descent.

The following plots show the impacts of these two parameters on the convergence rate of SGD with logistic regression while doing sentiment analysis on amazon product review dataset, an assignment that appeared in a coursera course on Machine Learning - Classification by the University of Washington:

For more detailed information on this you may refer to https://sandipanweb.wordpress.com/2017/03/31/online-learning-sentiment-analysis-with-logistic-regression-via-stochastic-gradient-ascent/?frame-nonce=987e584e16