machine-learning deep-learning gradient-descent calculus softmax

Understanding softmax classifier

I am trying to understand a simple implementation of Softmax classifier from this link - CS231n - Convolutional Neural Networks for Visual Recognition. Here they implemented a simple softmax classifier. In the example of Softmax Classifier on the link, there are random 300 points on a 2D space and a label associated with them. The softmax classifier will learn which point belong to which class.

Here is the full code of the softmax classifier. Or you can see the link I have provided.

# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))

# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength

# gradient descent loop
num_examples = X.shape[0]
for i in xrange(200):

   # evaluate class scores, [N x K]
   scores = np.dot(X, W) + b 

   # compute the class probabilities
   exp_scores = np.exp(scores)
   probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]

   # compute the loss: average cross-entropy loss and regularization
   corect_logprobs = -np.log(probs[range(num_examples),y])
   data_loss = np.sum(corect_logprobs)/num_examples
   reg_loss = 0.5*reg*np.sum(W*W)
   loss = data_loss + reg_loss
   if i % 10 == 0:
   print "iteration %d: loss %f" % (i, loss)

   # compute the gradient on scores
   dscores = probs
   dscores[range(num_examples),y] -= 1
   dscores /= num_examples

   # backpropate the gradient to the parameters (W,b)
   dW = np.dot(X.T, dscores)
   db = np.sum(dscores, axis=0, keepdims=True)

   dW += reg*W # regularization gradient

   # perform a parameter update
   W += -step_size * dW
   b += -step_size * db

I cant understand how they computed the gradient here. I assume that they computed the gradient here -

   dW = np.dot(X.T, dscores)
   db = np.sum(dscores, axis=0, keepdims=True)
   dW += reg*W # regularization gradient

But How? I mean Why gradient of dW is np.dot(X.T, dscores)? And Why the gradient of db is np.sum(dscores, axis=0, keepdims=True)?? So how they computed the gradient on weight and bias? Also why they computed the regularization gradient?

I am just starting to learn about convolutional neural networks and deep learning. And I heard that CS231n - Convolutional Neural Networks for Visual Recognition is a good starting place for that. I did not know where to place deep learning related post. So, i placed them on stackoverflow. If there is any place to post questions related to deep learning please let me know.

Solution

The gradients start being computed here:

# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples

First, this sets dscores equal to the probabilities computed by the softmax function. Then, it subtracts 1 from the probabilities computed for the correct classes in the second line, and then it divides by the number of training samples in the third line.

Why does it subtract 1? Because you want the probabilities of the correct labels to be 1, ideally. So it subtracts what it should predict from what it actually predicts: if it predicts something close to 1, the subtraction will be a large negative number (close to zero), so the gradient will be small, because you're close to a solution. Otherwise, it will be a small negative number (far from zero), so the gradient will be bigger, and you'll take larger steps towards the solution.

Your activation function is simply w*x + b. Its derivative with respect to w is x, which is why dW is the dot product between x and the gradient of the scores / output layer.

The derivative of w*x + b with respect to b is 1, which is why you simply sum dscores when backpropagating.