Let's say i have two matrix tf_t (shape : 5x3 ) and tf_b ( shape : 3x3). y_tf = tf.matmul(tf_t, tf_b) and then I've computed dy/dt using tf.gradient api
import tensorflow as tf
mat = [[0.8363, 0.4719, 0.9783],
[0.3379, 0.6548, 0.3835],
[0.7846, 0.9173, 0.2393],
[0.5418, 0.3875, 0.4276],
[0.0948, 0.2637, 0.8039]]
another_mat = [[ 0.43842274 ,-0.53439844, -0.07710262],
[ 1.5658046, -0.1012345 , -0.2744976 ],
[ 1.4204658 , 1.2609464, -0.43640924]]
tf_t = tf.Variable(tf.convert_to_tensor(mat))
tf_b = tf.Variable(tf.convert_to_tensor(another_mat))
with tf.GradientTape() as tape:
tape.watch(tf_t)
y_tf = tf.matmul(tf_t, tf_b)
y_t0 = y_tf[0,0]
# dy = 2x * dx
dy_dx = tape.gradient(y_tf, tf_t)
print(dy_dx)
I am getting below matrix as dy/dx
tf.Tensor(
[[-0.17307831 1.1900724 2.245003 ]
[-0.17307831 1.1900724 2.245003 ]
[-0.17307831 1.1900724 2.245003 ]
[-0.17307831 1.1900724 2.245003 ]
[-0.17307831 1.1900724 2.245003 ]], shape=(5, 3), dtype=float32)
The above matrix does not look right. because for the element y_tf[0,0]
Note : y_tf[0,0] = tf_t[0,0]*tf_b[0,0] + tf_t[0,1]*tf_b[1,0] + tf_t[0,2]*tf_b[2,0]
if I perform
tape.gradient(y_t0, tf_t)
I get the matrix like this
tf.Tensor(
[[0.43842274 1.5658046 1.4204658 ]
[0. 0. 0. ]
[0. 0. 0. ]
[0. 0. 0. ]
[0. 0. 0. ]], shape=(5, 3), dtype=float32)
The 1st row above is 1st column of matrix tf_b
which makes sense given how matrix multiplication works and If I were, to sum up, those numbers it's going to be 3.424693
However, the result I got as dy_dx
it has it's first element dy_dx[0,0]
as -0.17307831
which is a summation of 1st row of tf_b ( sum(tf_b[0,:])
!!
So can anyone please explain hows the gradient of tf_y[0,0] wrt tf_x
is reduced to -0.17307831
and not 3.424693
?
The question could appear similar to this but the answer I'm looking for is not addressed there with a clear picture.
The key notion to understand here is that tf.gradients
computes the gradients of the sum of the output(s) with respect to the input(s). That is dy_dx
represents the scale by which the sum of all elements of y_tf
changes as each element of tf_t
changes.
So, if you take tf_t[0, 0]
, that value is used to compute y_tf[0, 0]
, y_tf[0, 1]
and y_tf[0, 2]
, in each case with coefficients tf_b[0, 0]
, tf_b[0, 1]
and tf_b[0, 2]
. So, if I increased tf_t[0, 0]
by one, the sum of y_tf
would increase by tf_b[0, 0] + tf_b[0, 1] + tf_b[0, 2]
, which is the value of dy_dx[0, 0]
. Continuing with the same reasoning, each value tf_t[i, j]
is in fact multiplied by all the values in tf_b[j, :]
, so dy_dx
is a repetition of the sum of the rows of tf_b
.
When you compute the gradient of y_t0
with respect to tf_t
, then changes in tf_t[0, 0]
would change the sum of the result by a factor of tf_b[0, 0]
, so that is the value of the gradient in that case.