machine-learning deep-learning gradient-descent backpropagation softmax

cost becoming NaN after certain iterations

I am trying to do a multiclass classification problem (containing 3 labels) with softmax regression.

This is my first rough implementation with gradient descent and back propagation (without using regularization and any advanced optimization algorithm) containing only 1 layer.

Also when learning-rate is big (>0.003) cost becomes NaN, on decreasing learning-rate the cost function works fine. Can anyone explain what I'm doing wrong??

# X is (13,177) dimensional
# y is (3,177) dimensional with label 0/1

m = X.shape[1]                                        # 177
W = np.random.randn(3,X.shape[0])*0.01                # (3,13)
b = 0
cost = 0
alpha = 0.0001                                        # seems too small to me but for bigger values cost becomes NaN


for i in range(100):
    Z = np.dot(W,X) + b
    t = np.exp(Z)
    add = np.sum(t,axis=0)
    A = t/add
    loss = -np.multiply(y,np.log(A))
    cost += np.sum(loss)/m
    print('cost after iteration',i+1,'is',cost)
    dZ = A-y
    dW = np.dot(dZ,X.T)/m
    db = np.sum(dZ)/m
    W = W - alpha*dW
    b = b - alpha*db

This is what I get :

cost after iteration 1 is 6.661713420377916

cost after iteration 2 is 23.58974203186562

cost after iteration 3 is 52.75811642877174

.............................................................

...............*upto 100 iterations*.................

.............................................................

cost after iteration 99 is 1413.555298639879

cost after iteration 100 is 1429.6533630169406

Solution

Well after some time i figured it out.

First of all the cost was increasing due to this : cost += np.sum(loss)/m
Here plus sign is not needed as it will add all the previous cost computed on every epoch which is not what we want. This implementation is generally required during mini-batch gradient descent for computing cost over each epoch.

Secondly the learning rate is too big for this problem that's why cost was overshooting the minimum value and becoming NaN.
I looked in my code and find out that my features were of very different range (one was from -1 to 1 and other was -5000 to 5000) which was limiting my algorithm to use greater values for learning rate.

So I applied feature scaling :

var = np.var(X, axis=1)
X = X/var

Now learning rate can be much bigger (<=0.001).