Where in tensorflow gradients is the sum over the elements of y made?

I am trying to make a hack of tf.gradient in tensorflow that would give, for a tensor y of rank (M,N) and a tensor x of rank (Q,P) a gradient tensor of rank (M,N,Q,P) as one would naturally expect.

As pointed out in multiple questions on this site*, what one gets is a rank (Q,P) which is the grad of the sum of the elements of y. Now what I can't figure out, looking into the tensorflow code is where is that sum over elements of y is made? Is it as the beginning or at the end? Could someone help me pinpoint the lines of code where that is done?

*

Solution

I've answered it here but I'm guessing it's not very useful because you can't use this knowledge to be able to differentiate with respect to non-scalar y. Scalar y assumption is central to design of reverse AD algorithm, and there's not a single place you can modify to support non-scalar ys. Since this confusion keeps coming up, let me go in a bit more detail as to why it's non-trivial:

First of all, how reverse AD works -- suppose we have a function f that's composition of component functions f_i. Each component function takes a vector of length n and produces a vector of length n.

Its derivative can be expressed as a sequence of matrix multiplications. The entire expression can be expressed below.

When differentiating, function composition becomes matrix multiplication of corresponding component function Jacobians.

Note that this involves matrix/matrix products which proves to be too expensive for neural networks. IE, AlexNet contains 8k activations in its convnet->fc transition layer. Doing matrix multiples where each matrix is 8k x 8k would take too long. The trick that makes it efficient is to assume that last function in the chain produces a scalar. Then its Jacobian is a vector, and the whole thing can be rewritten in terms of vector-matrix multiplies, instead of matrix-matrix multiplies.

This product can be computed efficiently by doing multiplication left to right so everything you do is an nxn vector-matrix multiply instead of nxn matrix-matrix multiply.

You can make it even more efficient by never forming those nxn derivative matrices in a first place, and associate each component function with an op that does vector x Jacobian matrix product implicitly. That's what TensorFlow tf.RegisterGradient does. Here's an illustration of the "grad" associated with an a component function.

Now, this is done for vector value functions, what if your functions are matrix valued? This is a typical situation we deal with in neural networks. IE, in a layer that does matrix multiply, matrix that you multiply by is an unknown and it is matrix valued. In that case, the last derivative has rank 2, and remaining derivatives have rank 3.

Now to apply the chain rule you'd have to deal with extra notation because now "x" in chain rule means matrix multiplication generalized to tensors of rank-3.

However, note that we never have to do the multiplication explicitly since we are using a grad operator. So now in practice, this operator now takes values of rank-2 and produces values of rank-2.

So in all of this, there's an assumption that final target is scalar which allows fully connected layers to be differentiated by passing matrices around.

If you want to extend this to support non-scalar vector, you would need to modify the reverse AD algorithm to to propagate more info. IE, for fully connected feed-forward nets you would propagate rank-3 tensors around instead of matrices.