python numpy machine-learning gradient-descent

Why does cost function for MLP flatten?

I am very new to machine learning and am trying to implement an MLP however the cost function seems to be reaching a local minimum before reaching the global minimum. I plotted the cost as a function of iteration (including a 0 value as to not be fooled by where the y-axis starts). Here is the code that I am using at my attempt:

import numpy as np

class NNet(object):
def __init__(self, n_in, n_hidden, n_out):
    self.n_in = n_in
    self.n_hidden = n_hidden
    self.n_out = n_out

    self.W1 = np.random.randn(n_in, n_hidden)
    self.W2 = np.random.randn(n_hidden, n_out)

    self.b1 = np.random.randn(n_hidden,)
    self.b2 = np.random.randn(n_out,)

def sigmoid(self, z):
    return 1/(1 + np.exp(-z))

def sig_prime(self, z):
    return (np.exp(-z))/((1+np.exp(-z))**2)

def propagate_forward(self, X):
    self.z1 = np.dot(self.W1.T, X) + self.b1
    self.a1 = self.sigmoid(self.z1)
    self.z2 = np.dot(self.W2.T, self.a1) + self.b2
    self.a2 = self.sigmoid(self.z2)
    return self.a2

def cost(self, y, y_hat):
    return np.mean([np.sum((y[i] - y_hat[i])**2) for i in range(y.shape[0])])/2

def cost_grad(self, X, y):
    y_hat = self.propagate_forward(X)

    d2 = np.multiply(self.sig_prime(self.z2), -(y - y_hat))
    gJ_W2 = np.matrix(np.multiply(self.a1.T, d2))

    d1 = np.dot(self.W2, d2)*self.sig_prime(self.z1)
    gJ_W1 = np.dot(np.matrix(X).T, np.matrix(d1))

    return [gJ_W1, d1, gJ_W2, d2]

m = 1000
n = 1

X = np.zeros((m, n))
y = np.zeros((m,1))

import random
import math

i = 0
for r, theta in zip(np.linspace(0, 5, num=m), np.linspace(0, 8 * math.pi, num=m)):
    r += random.random()
    X[i] = [r * math.cos(theta), r * math.sin(theta)]
    if i < 333:
        y[i] = 0
    elif i < 666:
        y[i] = 1
    else:
        y[i] = 2
    i += 1

nnet = NNet(n, 5, 1)
learning_rate = 0.2
improvement_threshold = 0.995
cost = np.inf

xs = []
ys = []

iter = 0
while cost > 0.2:
cost = nnet.cost(y, [nnet.propagate_forward(x_train) for x_train 

if iter % 100 == 0:
    xs.append(iter)
    ys.append(cost)
    print("Cost", cost)

if iter >= 1000:
    print("Gradient descent is taking too long, giving up.")
    break

cost_grads = [nnet.cost_grad(x_train, y_train) for x_train, y_train in zip(X, y)]
gW1 = [grad[0] for grad in cost_grads]
gb1 = [grad[1] for grad in cost_grads]
gW2 = [grad[2] for grad in cost_grads]
gb2 = [grad[3] for grad in cost_grads]

nnet.W1 -= np.mean(gW1, axis=0)/2 * learning_rate
nnet.b1 -= np.mean(gb1, axis=0)/2 * learning_rate
nnet.W2 -= np.mean(gW2, axis=0).T/2 * learning_rate
nnet.b2 -= np.mean(gb2, axis=0)/2 * learning_rate

iter += 1

Why is the cost not improving after a certain point? Also any other tips are highly appreciated.

The generated toy dataset looks like this

Solution

Your goal seems to be to predict to which class {0,1,2} belongs the data.

The output of your net is a sigmoid (sigm(x) in [0,1]) and you're training using mean squared error (MSE), it's impossible for the model to predict a value above 1. So it's always wrong when the class to predict is 2.

The cost probably flattens because your sigmoid unit saturate (when trying to predict 2) and the gradient for saturating sigmoid is 0

For classification neural net normally end with a softmax layer and are trained using cross-entropy.

If you want to keep using MSE and sigmoids unit for classification, you should consider predicting only two classes at a time in a One-vs-(One/All) kinda way.

Anyway, if you only do bi-class classification by rounding output to 0 or 1,it seems to work. Cost is decreasing and accuracy rising (quickly modified code):