Quick question as I'm kind of confused here.
Let's say we have a simple graph:
a = tf.Variable(tf.truncated_normal(shape=[200, 1], mean=0., stddev=.5))
b = tf.Variable(tf.truncated_normal(shape=[200, 100], mean=0., stddev=.5))
add = a+b
add
<tf.Tensor 'add:0' shape=(200, 100) dtype=float32> #shape is because of broadcasting
So I've got a node that takes in 2 tensors, and produces 1 tensor as an output. Let's now run tf.gradients on it
tf.gradients(add, [a, b])
[<tf.Tensor 'gradients/add_grad/Reshape:0' shape=(200, 1) dtype=float32>,
<tf.Tensor 'gradients/add_grad/Reshape_1:0' shape=(200, 100) dtype=float32>]
So we get gradients exactly in the shape of the input tensors. But... why? Not like there's a single metric with respect to which we can take the partial derivative. Shouldn't the gradients map from every single value of the input tensors to every single value of the output tensor, effectively giving a 200x1x200x100 gradients for input a?
This is just a simple example where every element of the output tensor depends only on one value from tensor b, and one row from tensor a. However if we did something more complicated, like running a gaussian blur on a tensor then gradients would surely have to be bigger than just the input tensor.
What am I getting here wrong?
By default tf.gradients takes the gradient of the scalar you get by summing all elements of all tensors passed to tf.gradients as outputs.