Search code examples
pythonbatch-processinggradient-descentstochastic

SGD implementation Python


I am aware that SGD has been asked before on SO but I wanted to have an opinion on my code as below:

import numpy as np
import matplotlib.pyplot as plt

# Generating data

m,n = 10000,4
x = np.random.normal(loc=0,scale=1,size=(m,4))
theta_0 = 2
theta = np.append([],[1,0.5,0.25,0.125]).reshape(n,1)
y = np.matmul(x,theta) + theta_0*np.ones(m).reshape((m,1)) + np.random.normal(loc=0,scale=0.25,size=(m,1))


# input features
x0 = np.ones([m,1])
X = np.append(x0,x,axis=1)

# defining the cost function
def compute_cost(X,y,theta_GD):
       return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2


# initializations

theta_GD = np.append([theta_0],[theta]).reshape(n+1,1)
alp = 1e-5
num_iterations = 10000

# Batch Sum
def batch(i,j,theta_GD):
    batch_sum = 0
    for k in range(i,i+9):
        batch_sum += float((y[k]-np.transpose(theta_GD).dot(X[k]))*X[k][j])
    return batch_sum

# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
    for j in range(0,n):
            theta_current[j]-= alp*batch(i,j,theta_current)/10

    theta_updated = theta_current

    return theta_updated

# gradient descent
cost_vec = []
for i in range(num_iterations):

    cost_vec.append(compute_cost(X[i], y[i], theta_GD))
    theta_GD = gradient_step(theta_GD, X, y, alp,i)


plt.plot(cost_vec)
plt.xlabel('iterations')
plt.ylabel('cost')

I was trying a mini-batch GD with a batch size of 10. I am getting extremely oscillatory behavior for the MSE. Where's the issue? Thanks.

P.S. I was following NG's https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent


Solution

  • This is a description of the underlying mathematical principle, not a code based solution...

    The cost function is highly nonlinear (np.power()) and recursive and recursive and nonlinear systems can oscillate ( self-oscillation https://en.wikipedia.org/wiki/Self-oscillation ). In mathematics this is subject to chaos theory / theory of nonlinear dynamical systems ( https://pdfs.semanticscholar.org/8e0d/ee3c433b1806bfa0d98286836096f8c2681d.pdf ), cf the Logistic Map ( https://en.wikipedia.org/wiki/Logistic_map ). The logistic map oscillates if the growth factor r exceeds a threshold. The growth factor is a measure for how much energy is in the system.

    In your code the critical parts are the cost function, the cost vector, that is the history of the system and the time steps :

    def compute_cost(X,y,theta_GD):
       return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2
    
    cost_vec = []
    for i in range(num_iterations):
    
        cost_vec.append(compute_cost(X[i], y[i], theta_GD))
        theta_GD = gradient_step(theta_GD, X, y, alp,i)
    
    # Gradient Step
    def gradient_step(theta_current, X, y, alp,i):
        for j in range(0,n):
                theta_current[j]-= alp*batch(i,j,theta_current)/10
    
        theta_updated = theta_current
    return theta_updated
    

    If you compare this to an implementation of the logistic map you see the similarities

    from pylab import show, scatter, xlim, ylim
    from random import randint
    
    iter = 1000         # Number of iterations per point
    seed = 0.5          # Seed value for x in (0, 1)
    spacing = .0001     # Spacing between points on domain (r-axis)
    res = 8             # Largest n-cycle visible
    
    # Initialize r and x lists
    rlist = []
    xlist = []
    
    def logisticmap(x, r):     <------------------ nonlinear function
    
        return x * r * (1 - x)
    
    # Return nth iteration of logisticmap(x. r)
    def iterate(n, x, r):
    
        for i in range(1,n):
            x = logisticmap(x, r)
    
        return x
    
    # Generate list values -- iterate for each value of r
    for r in [i * spacing for i in range(int(1/spacing),int(4/spacing))]:
       rlist.append(r) 
       xlist.append(iterate(randint(iter-res/2,iter+res/2), seed, r))   <--------- similar to cost_vector, the history of the system
    
    scatter(rlist, xlist, s = .01)
    xlim(0.9, 4.1)
    ylim(-0.1,1.1)
    show()
    

    source of code : https://www.reddit.com/r/learnpython/comments/zzh28/a_simple_python_implementation_of_the_logistic_map/

    Basing on this you can try to modify your cost function by introducing a factor similar to the growth factor in the logistic map to reduce the intensity of oscillation of the system

    def gradient_step(theta_current, X, y, alp,i):
        for j in range(0,n):
                theta_current[j]-= alp*batch(i,j,theta_current)/10   <--- introduce a factor somewhere to keep the system under the oscillation threshold
    
        theta_updated = theta_current
    
        return theta_updated     
    

    or

    def compute_cost(X,y,theta_GD):
       return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2  <--- introduce a factor somewhere to keep the system under the oscillation threshold
    

    If this is not working maybe follow the suggestions in https://www.reddit.com/r/MachineLearning/comments/3y9gkj/how_can_i_avoid_oscillations_in_gradient_descent/ ( timesteps,... )