tensorflow machine-learning deep-learning calculus

Backpropagation of softmax function in tensorflow

I'm trying to find out how the backpropagation of the tf.nn.softmax() function in tensorflow works in order to use it in my project. So for that I implement the following simple network to verify the derivatives of the softmax layer from tensorflow network similar to the mathematically derived derivatives.

x=tf.placeholder(tf.float32,[5])
y_true = tf.placeholder(tf.float32,[5])

w=tf.Variable(tf.zeros([5]))

logits = tf.multiply(x,w)

y = tf.nn.softmax(logits)

loss = tf.pow(y - y_true,2)

cost = tf.reduce_mean(loss)

train_x = [1.0,2.0,3.0,4.0,5.0]
train_y = [3.0,4.0,5.0,6.0,7.0]

sess = tf.Session()
sess.run(tf.initialize_all_variables())

# Following function is to print essential layer values required.
def get_val():
    print('LOSS  : ', sess.run(loss,feed_dict={x:train_x,y_true:train_y}))
    print('COST  : ', sess.run(cost,feed_dict={x:train_x,y_true:train_y}))
    print('Y     : ', sess.run(y,feed_dict={x:train_x,y_true:train_y}))
    print('LOGITS: ', sess.run(logits,feed_dict={x:train_x,y_true:train_y}))
    print('W     : ', sess.run(w,feed_dict={x:train_x,y_true:train_y}))

# before training
get_val()

# normal gradient decent optimizer used to calculate weight values
optimizer=tf.train.GradientDescentOptimizer(learning_rate=1).minimize(cost)

# train only for one time
sess.run(optimizer,feed_dict={x:train_x,y_true:train_y})

#after training
get_val()

Here you can see the values I got using get_val() function.

**Before Training**
LOSS  :  [ 7.8399997, 14.44,      23.04,      33.640003,  46.24     ]
COST  :  25.040003
Y     :  [0.2, 0.2, 0.2, 0.2, 0.2]
LOGITS:  [0., 0., 0., 0., 0.]
W     :  [0., 0., 0., 0., 0.]

**After Training**
LOSS  :  [ 8.916067, 15.904554, 24.835724, 35.293324, 37.2296  ]
COST  :  24.435854
Y     :  [0.01402173, 0.01194853, 0.01645466, 0.0591815,  0.8983936 ]
LOGITS:  [-0.16000001, -0.32000008  0.,          1.2800003,   3.9999998 ]
W     :  [-0.16000001, -0.16000004,  0.,         0.32000008,  0.79999995]

y_true = train_y
m = 5
alpha = 1 # learning rate
x = train_x

Using this function, I'm going to calculate the weights after the first training.

These are the weight values I got using this function. [-0.1792, -0.4864, -0.9216, -1.4848, -2.176 ]

But it is not similar to the weight values I got after training the tensorflow network. These are the weight values after the training. [-0.16000001, -0.16000004, 0., 0.32000008, 0.79999995]

Can anyone explain me why the my function did not give the weight values as I expected.

Solution

Above equation is the derived equation for the weight derivatives. And weight update can be done accordingly with the help of gradient descent optimizer.