python batch-processing gradient-descent stochastic

SGD implementation Python

I am aware that SGD has been asked before on SO but I wanted to have an opinion on my code as below:

import numpy as np
import matplotlib.pyplot as plt

# Generating data

m,n = 10000,4
x = np.random.normal(loc=0,scale=1,size=(m,4))
theta_0 = 2
theta = np.append([],[1,0.5,0.25,0.125]).reshape(n,1)
y = np.matmul(x,theta) + theta_0*np.ones(m).reshape((m,1)) + np.random.normal(loc=0,scale=0.25,size=(m,1))


# input features
x0 = np.ones([m,1])
X = np.append(x0,x,axis=1)

# defining the cost function
def compute_cost(X,y,theta_GD):
       return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2


# initializations

theta_GD = np.append([theta_0],[theta]).reshape(n+1,1)
alp = 1e-5
num_iterations = 10000

# Batch Sum
def batch(i,j,theta_GD):
    batch_sum = 0
    for k in range(i,i+9):
        batch_sum += float((y[k]-np.transpose(theta_GD).dot(X[k]))*X[k][j])
    return batch_sum

# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
    for j in range(0,n):
            theta_current[j]-= alp*batch(i,j,theta_current)/10

    theta_updated = theta_current

    return theta_updated

# gradient descent
cost_vec = []
for i in range(num_iterations):

    cost_vec.append(compute_cost(X[i], y[i], theta_GD))
    theta_GD = gradient_step(theta_GD, X, y, alp,i)


plt.plot(cost_vec)
plt.xlabel('iterations')
plt.ylabel('cost')

I was trying a mini-batch GD with a batch size of 10. I am getting extremely oscillatory behavior for the MSE. Where's the issue? Thanks.

P.S. I was following NG's https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent

Solution

This is a description of the underlying mathematical principle, not a code based solution...

The cost function is highly nonlinear (np.power()) and recursive and recursive and nonlinear systems can oscillate ( self-oscillation https://en.wikipedia.org/wiki/Self-oscillation ). In mathematics this is subject to chaos theory / theory of nonlinear dynamical systems ( https://pdfs.semanticscholar.org/8e0d/ee3c433b1806bfa0d98286836096f8c2681d.pdf ), cf the Logistic Map ( https://en.wikipedia.org/wiki/Logistic_map ). The logistic map oscillates if the growth factor r exceeds a threshold. The growth factor is a measure for how much energy is in the system.

In your code the critical parts are the cost function, the cost vector, that is the history of the system and the time steps :

def compute_cost(X,y,theta_GD):
   return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2

cost_vec = []
for i in range(num_iterations):

    cost_vec.append(compute_cost(X[i], y[i], theta_GD))
    theta_GD = gradient_step(theta_GD, X, y, alp,i)

# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
    for j in range(0,n):
            theta_current[j]-= alp*batch(i,j,theta_current)/10

    theta_updated = theta_current
return theta_updated

If you compare this to an implementation of the logistic map you see the similarities

from pylab import show, scatter, xlim, ylim
from random import randint

iter = 1000         # Number of iterations per point
seed = 0.5          # Seed value for x in (0, 1)
spacing = .0001     # Spacing between points on domain (r-axis)
res = 8             # Largest n-cycle visible

# Initialize r and x lists
rlist = []
xlist = []

def logisticmap(x, r):     <------------------ nonlinear function

    return x * r * (1 - x)

# Return nth iteration of logisticmap(x. r)
def iterate(n, x, r):

    for i in range(1,n):
        x = logisticmap(x, r)

    return x

# Generate list values -- iterate for each value of r
for r in [i * spacing for i in range(int(1/spacing),int(4/spacing))]:
   rlist.append(r) 
   xlist.append(iterate(randint(iter-res/2,iter+res/2), seed, r))   <--------- similar to cost_vector, the history of the system

scatter(rlist, xlist, s = .01)
xlim(0.9, 4.1)
ylim(-0.1,1.1)
show()

source of code : https://www.reddit.com/r/learnpython/comments/zzh28/a_simple_python_implementation_of_the_logistic_map/

Basing on this you can try to modify your cost function by introducing a factor similar to the growth factor in the logistic map to reduce the intensity of oscillation of the system

def gradient_step(theta_current, X, y, alp,i):
    for j in range(0,n):
            theta_current[j]-= alp*batch(i,j,theta_current)/10   <--- introduce a factor somewhere to keep the system under the oscillation threshold

    theta_updated = theta_current

    return theta_updated

def compute_cost(X,y,theta_GD):
   return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2  <--- introduce a factor somewhere to keep the system under the oscillation threshold

If this is not working maybe follow the suggestions in https://www.reddit.com/r/MachineLearning/comments/3y9gkj/how_can_i_avoid_oscillations_in_gradient_descent/ ( timesteps,... )