Search code examples
pythonnumpymachine-learninggradient-descent

Why does cost function for MLP flatten?


I am very new to machine learning and am trying to implement an MLP however the cost function seems to be reaching a local minimum before reaching the global minimum. I plotted the cost as a function of iteration (including a 0 value as to not be fooled by where the y-axis starts). Cost plot Here is the code that I am using at my attempt:

import numpy as np

class NNet(object):
def __init__(self, n_in, n_hidden, n_out):
    self.n_in = n_in
    self.n_hidden = n_hidden
    self.n_out = n_out

    self.W1 = np.random.randn(n_in, n_hidden)
    self.W2 = np.random.randn(n_hidden, n_out)

    self.b1 = np.random.randn(n_hidden,)
    self.b2 = np.random.randn(n_out,)

def sigmoid(self, z):
    return 1/(1 + np.exp(-z))

def sig_prime(self, z):
    return (np.exp(-z))/((1+np.exp(-z))**2)

def propagate_forward(self, X):
    self.z1 = np.dot(self.W1.T, X) + self.b1
    self.a1 = self.sigmoid(self.z1)
    self.z2 = np.dot(self.W2.T, self.a1) + self.b2
    self.a2 = self.sigmoid(self.z2)
    return self.a2

def cost(self, y, y_hat):
    return np.mean([np.sum((y[i] - y_hat[i])**2) for i in range(y.shape[0])])/2

def cost_grad(self, X, y):
    y_hat = self.propagate_forward(X)

    d2 = np.multiply(self.sig_prime(self.z2), -(y - y_hat))
    gJ_W2 = np.matrix(np.multiply(self.a1.T, d2))

    d1 = np.dot(self.W2, d2)*self.sig_prime(self.z1)
    gJ_W1 = np.dot(np.matrix(X).T, np.matrix(d1))

    return [gJ_W1, d1, gJ_W2, d2]

m = 1000
n = 1

X = np.zeros((m, n))
y = np.zeros((m,1))

import random
import math

i = 0
for r, theta in zip(np.linspace(0, 5, num=m), np.linspace(0, 8 * math.pi, num=m)):
    r += random.random()
    X[i] = [r * math.cos(theta), r * math.sin(theta)]
    if i < 333:
        y[i] = 0
    elif i < 666:
        y[i] = 1
    else:
        y[i] = 2
    i += 1

nnet = NNet(n, 5, 1)
learning_rate = 0.2
improvement_threshold = 0.995
cost = np.inf

xs = []
ys = []

iter = 0
while cost > 0.2:
cost = nnet.cost(y, [nnet.propagate_forward(x_train) for x_train 

if iter % 100 == 0:
    xs.append(iter)
    ys.append(cost)
    print("Cost", cost)

if iter >= 1000:
    print("Gradient descent is taking too long, giving up.")
    break

cost_grads = [nnet.cost_grad(x_train, y_train) for x_train, y_train in zip(X, y)]
gW1 = [grad[0] for grad in cost_grads]
gb1 = [grad[1] for grad in cost_grads]
gW2 = [grad[2] for grad in cost_grads]
gb2 = [grad[3] for grad in cost_grads]

nnet.W1 -= np.mean(gW1, axis=0)/2 * learning_rate
nnet.b1 -= np.mean(gb1, axis=0)/2 * learning_rate
nnet.W2 -= np.mean(gW2, axis=0).T/2 * learning_rate
nnet.b2 -= np.mean(gb2, axis=0)/2 * learning_rate

iter += 1

Why is the cost not improving after a certain point? Also any other tips are highly appreciated.

The generated toy dataset looks like thisData


Solution

  • Your goal seems to be to predict to which class {0,1,2} belongs the data.

    The output of your net is a sigmoid (sigm(x) in [0,1]) and you're training using mean squared error (MSE), it's impossible for the model to predict a value above 1. So it's always wrong when the class to predict is 2.

    • The cost probably flattens because your sigmoid unit saturate (when trying to predict 2) and the gradient for saturating sigmoid is 0 Sigmoid

    For classification neural net normally end with a softmax layer and are trained using cross-entropy.

    If you want to keep using MSE and sigmoids unit for classification, you should consider predicting only two classes at a time in a One-vs-(One/All) kinda way.

    Anyway, if you only do bi-class classification by rounding output to 0 or 1,it seems to work. Cost is decreasing and accuracy rising (quickly modified code):

    CostAccuracy