I am aware that SGD has been asked before on SO but I wanted to have an opinion on my code as below:
import numpy as np
import matplotlib.pyplot as plt
# Generating data
m,n = 10000,4
x = np.random.normal(loc=0,scale=1,size=(m,4))
theta_0 = 2
theta = np.append([],[1,0.5,0.25,0.125]).reshape(n,1)
y = np.matmul(x,theta) + theta_0*np.ones(m).reshape((m,1)) + np.random.normal(loc=0,scale=0.25,size=(m,1))
# input features
x0 = np.ones([m,1])
X = np.append(x0,x,axis=1)
# defining the cost function
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2
# initializations
theta_GD = np.append([theta_0],[theta]).reshape(n+1,1)
alp = 1e-5
num_iterations = 10000
# Batch Sum
def batch(i,j,theta_GD):
batch_sum = 0
for k in range(i,i+9):
batch_sum += float((y[k]-np.transpose(theta_GD).dot(X[k]))*X[k][j])
return batch_sum
# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10
theta_updated = theta_current
return theta_updated
# gradient descent
cost_vec = []
for i in range(num_iterations):
cost_vec.append(compute_cost(X[i], y[i], theta_GD))
theta_GD = gradient_step(theta_GD, X, y, alp,i)
plt.plot(cost_vec)
plt.xlabel('iterations')
plt.ylabel('cost')
I was trying a mini-batch GD with a batch size of 10. I am getting extremely oscillatory behavior for the MSE. Where's the issue? Thanks.
P.S. I was following NG's https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent
This is a description of the underlying mathematical principle, not a code based solution...
The cost function is highly nonlinear (np.power()
) and recursive and recursive and nonlinear systems can oscillate ( self-oscillation https://en.wikipedia.org/wiki/Self-oscillation ). In mathematics this is subject to chaos theory / theory of nonlinear dynamical systems ( https://pdfs.semanticscholar.org/8e0d/ee3c433b1806bfa0d98286836096f8c2681d.pdf ), cf the Logistic Map
( https://en.wikipedia.org/wiki/Logistic_map ). The logistic map oscillates if the growth factor r exceeds a threshold. The growth factor is a measure for how much energy is in the system.
In your code the critical parts are the cost function, the cost vector, that is the history of the system and the time steps :
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2
cost_vec = []
for i in range(num_iterations):
cost_vec.append(compute_cost(X[i], y[i], theta_GD))
theta_GD = gradient_step(theta_GD, X, y, alp,i)
# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10
theta_updated = theta_current
return theta_updated
If you compare this to an implementation of the logistic map you see the similarities
from pylab import show, scatter, xlim, ylim
from random import randint
iter = 1000 # Number of iterations per point
seed = 0.5 # Seed value for x in (0, 1)
spacing = .0001 # Spacing between points on domain (r-axis)
res = 8 # Largest n-cycle visible
# Initialize r and x lists
rlist = []
xlist = []
def logisticmap(x, r): <------------------ nonlinear function
return x * r * (1 - x)
# Return nth iteration of logisticmap(x. r)
def iterate(n, x, r):
for i in range(1,n):
x = logisticmap(x, r)
return x
# Generate list values -- iterate for each value of r
for r in [i * spacing for i in range(int(1/spacing),int(4/spacing))]:
rlist.append(r)
xlist.append(iterate(randint(iter-res/2,iter+res/2), seed, r)) <--------- similar to cost_vector, the history of the system
scatter(rlist, xlist, s = .01)
xlim(0.9, 4.1)
ylim(-0.1,1.1)
show()
source of code : https://www.reddit.com/r/learnpython/comments/zzh28/a_simple_python_implementation_of_the_logistic_map/
Basing on this you can try to modify your cost function by introducing a factor similar to the growth factor in the logistic map to reduce the intensity of oscillation of the system
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10 <--- introduce a factor somewhere to keep the system under the oscillation threshold
theta_updated = theta_current
return theta_updated
or
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2 <--- introduce a factor somewhere to keep the system under the oscillation threshold
If this is not working maybe follow the suggestions in https://www.reddit.com/r/MachineLearning/comments/3y9gkj/how_can_i_avoid_oscillations_in_gradient_descent/ ( timesteps,... )