GradientTape with Keras returns 0

I've tried using GradientTape with a Keras model (simplified) as follows:

import tensorflow as tf
tf.enable_eager_execution()

input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')

import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')
target = tf.constant([[1,0,0,0,0,0,0,0,0,0]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as g:
    g.watch(inp)
    result = model(inp, training=False)

print(tf.reduce_max(tf.abs(g.gradient(result, inp))))

But for some random values of inp, the gradient is zero everywhere, and for the rest, the gradient magnitude is really small (<1e-7).

I've also tried this with a MNIST-trained 3-layer MLP and the results are the same, but trying it with a 1-layer Linear model with no activation works.

What's going on here?

Solution

You are computing gradients of a softmax output layer -- since softmax always always sums to 1, it makes sense that the gradients (which, in a multi-putput case, are summed/averaged over dimensions AFAIK) must be 0 -- the overall output of the layer cannot change. The cases where you get small values > 0 are numerical hiccups, I presume.
When you remove the activation function, this limitation no longer holds and the activations can become larger (meaning gradients with magnitude > 0).

Are you trying to use gradient descent to construct inputs that result in a very large probability for a certain class (if not, disregard this...)? @jdehesa already included a way to do this via the loss function. Note that you can do it via the softmax as well, like so:

import tensorflow as tf
tf.enable_eager_execution()

input_ = tf.keras.layers.Input(shape=(28, 28))
flat = tf.keras.layers.Flatten()(input_)
output = tf.keras.layers.Dense(10, activation='softmax')(flat)
model = tf.keras.Model(input_, output)
model.compile(loss='categorical_crossentropy', optimizer='sgd')

import numpy as np
inp = tf.Variable(np.random.random((1,28,28)), dtype=tf.float32, name='input')   
with tf.GradientTape(persistent=True) as g:
    g.watch(inp)
    result = model(inp, training=False)[:,0]

print(tf.reduce_max(tf.abs(g.gradient(result, inp))))

Note that I grab only the results in column 0, corresponding to the first class (I removed target because it's not used). This will compute gradients only for the softmax value for this class, which are meaningful.

Some caveats:

It's important to do the indexing inside the gradient tape context manager! If you do it outside (e.g. in the line where you call g.gradient, this will not work (no gradients)
You can also use gradients of the logits (pre-softmax values) instead. This is different, because softmax probabilities can be increased by making other classes less likely, whereas logits can only be increased by increasing the "score" for the class in question.