python machine-learning neural-network logistic-regression gradient-descent

Gradient Checking works for binary but fails for multi class

I've built a logistic regression model for binary classification on Iris-dataset ( just two labels ) .. This model achieved good performance on all metrics and also passed the gradient check as given by Andrew Ng. But when I change the output activation to "Softmax" from "Sigmoid" and make it suitable for multi-class classification, even though the performance metrics are pretty good, this model fails the gradient check.

Same pattern for a Deep Neural Network, my implementation with numpy passed gradient check for binary classification but fails for multi class.

Logistic Regression ( Binary ) :

I chose a row-major implementation style for my features ( no. of rows , no. of cols ) but not the column major style, just to make it intuitive to understand and debug.

Dimensions: X = (100, 4 ) ; Weights = (4, 1 ); y = (100,1)

Algorithm Implementation Code (binary) :

import numpy as np

from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import log_loss
from keras.losses import CategoricalCrossentropy
from scipy.special import softmax


def sigmoid(x):

   return ( (np.exp(x)) / (1 + np.exp(x) )  )




 dataset = load_iris()
 lb = LabelBinarizer() # Not used for binary classification


 X = dataset.data
 y = dataset.target



 data = np.concatenate((X[:100],y[:100].reshape(-1,1)), axis = 1)
 np.random.shuffle(data)

 X_train = data[:, :-1]
 X_b = np.c_[np.ones((X_train.shape[0] , 1)), X_train]

 y_train = data[:, -1].reshape(-1,1)

 num_unique_labels = len( np.unique(y_train) )


 Weights = np.random.randn(X_train.shape[1]+1, num_unique_labels-1)* np.sqrt(1./ (X_train.shape[1]+1)  )



 m = X_b.shape[0]

 yhat = sigmoid( np.dot(X_b, Weights))
 loss = log_loss(y_train, yhat)


 error = yhat - y_train

 gradient = (1./m) * ( X_b.T.dot(error)  )

Gradient Checking ( binary ):

 grad = gradient.reshape(-1,1)
 Weights_delta = Weights.reshape(-1,1)
 num_params = Weights_delta.shape[0]

 JP = np.zeros((num_params,1))
 JM = np.zeros((num_params,1))
 J_app = np.zeros((num_params,1))

 ep = float(1e-7)



for i in range(num_params):


  Weights_add = np.copy(Weights_delta)

  Weights_add[i] = Weights_add[i] + ep


  Z_add = sigmoid(np.dot(X_b, Weights_add.reshape(X_train.shape[1]+1,num_unique_labels-1)))

  JP[i] = log_loss( y_train, Z_add)


  Weights_sub = np.copy(Weights_delta)

  Weights_sub[i] = Weights_sub[i] - ep



  Z_sub = sigmoid(np.dot(X_b, Weights_sub.reshape(X_train.shape[1]+1,num_unique_labels-1)))

  JM[i] = log_loss( y_train, Z_sub)


  J_app[i] = (JP[i] - JM[i]) / (2*ep)

num = np.linalg.norm(grad - J_app)

denom = np.linalg.norm(grad) + np.linalg.norm(J_app)

num/denom

This results in a value ( num/denom ) : 8.244172628899919e-10 . Which confirms that gradient calculation is appropriate. For multi_class version, I've used the same gradient calculation from above but changed the output activation to Softmax ( also taken from scipy ) , and used axis = 1 to identify highest probability of a sample as mine is a row-major implementation.

Algorithm Implementation Code (multi_class) :

*Dimensions: X = (150, 4) ; Weights = (4,3) ; y = (150, 3)*

import numpy as np

from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import LabelBinarizer
from keras.losses import CategoricalCrossentropy
from scipy.special import softmax

CCE = CategoricalCrossentropy()


dataset = load_iris()
lb = LabelBinarizer()


X = dataset.data
y = dataset.target

lb.fit(y)

data = np.concatenate((X,y.reshape(-1,1)), axis = 1)
np.random.shuffle(data)

X_train = data[:, :-1]
X_b = np.c_[np.ones((X_train.shape[0] , 1)), X_train]


y_train = lb.transform(data[:, -1]).reshape(-1,3)


num_unique_labels = len( np.unique(y) )


Weights = np.random.randn(X_train.shape[1]+1, num_unique_labels) * np.sqrt(1./ (X_train.shape[1]+1)  )




m = X_b.shape[0]

yhat = softmax( np.dot(X_b, Weights), axis = 1)
cce_loss = CCE(y_train, yhat).numpy()

error = yhat - y_train

gradient = (1./m) * ( X_b.T.dot(error)  )

Gradient Checking ( multi_class ):

grad = gradient.reshape(-1,1)
Weights_delta = Weights.reshape(-1,1)
num_params = Weights_delta.shape[0]

JP = np.zeros((num_params,1))
JM = np.zeros((num_params,1))
J_app = np.zeros((num_params,1))

ep = float(1e-7)

for i in range(num_params):

   Weights_add = np.copy(Weights_delta)

   Weights_add[i] = Weights_add[i] + ep


   Z_add = softmax(np.dot(X_b, Weights_add.reshape(X_train.shape[1]+1,num_unique_labels)),                           axis = 1)

   JP[i] = CCE( y_train, Z_add).numpy()


   Weights_sub = np.copy(Weights_delta)

   Weights_sub[i] = Weights_sub[i] - ep


   Z_sub = softmax(np.dot(X_b, Weights_sub.reshape(X_train.shape[1]+1,num_unique_labels)), axis = 1)

   JM[i] = CCE( y_train, Z_sub).numpy()


   J_app[i] = (JP[i] - JM[i]) / (2*ep)


num = np.linalg.norm(grad - J_app)

denom = np.linalg.norm(grad) + np.linalg.norm(J_app)

num/denom

This resulted in a value: 0.3345. Which is clearly unacceptable difference. Now this got me wondering whether I could trust my gradient checking code for binary label in the first place. I've tested this logistic regression code (with same gradient calculation) on also digits data set, the performance again was really good ( >95% accuracy, precision, recall ). What's really fascinating to me is, even though the performance of the model is good enough, it fails gradient check. Same case for Neural Network as I mentioned earlier ( passes for binary , fails for multi_class ).

I even tried the code which Andrew Ng offers as part of his coursera course, even that code passes for binary and fails for multi class. I can't seem to figure out where my codes have any bugs in them, if they do minor bugs, how could they pass in the first case ?

I looked at these SOs, but I feel they had a different issue than mine:

Gradient checking in backpropogation

2.Checking the gradients when doing ...

3.problem with ann back-propagation ..

Here's what I'm looking for:

Suggestions / Corrections whether my gradient calculation and gradient checking code for binary prediction is accurate.
Suggestions / general directions on where I could be going wrong with multi class implementations.

What will you get: (:P)

Gratitude of 20-something tech-guy who believes every documentation page is poorly written :)

Update: Corrected some typos and added more lines of code as suggested by Alex. I also realized that my approximate gradient values ( by the name J_app ), in case of multi class prediction is pretty high ( 1e+2 ); because I was multiplying by a factor of (1./m) to my original gradients ( by the name gradient ), my original gradients values came out to be around (1e-1 to 1e-2 ).

This obvious difference in range of values of approximate gradients to my original gradients explains why I got a final value to the order of (1e+1, 0.3345 ). But, what I wasn't able to figure out is, how do I go about fixing this seemingly obvious bug I have.

Solution

All your computations seem to be correct. The reason why gradient check is failing is because CategoricalCrossentropy from keras is running with single precision by default. Because of that you are not getting enough precision in the final loss difference caused by weights' small updates. Add the following lines in the beginning of your script and you will get num/denom being usually around 1.e-9:

import keras
keras.backend.set_floatx('float64')