Tensorflow: use different expression for forward and backward pass

I have a tensorflow expression where I want to use a different expression depending on whether I'm computing the forward or backward (gradient) pass. Specifically, I want to ignore the effects of some randomness (noise) added into the network during the backwards pass.

Here's a simplified example

import numpy as np
import tensorflow as tf

x = tf.placeholder(tf.float32)
y = x**2
u = tf.random_uniform(tf.shape(x), minval=0.9, maxval=1.1)
yu = y * u
z = tf.sqrt(yu)
g = tf.gradients(z, x)[0]

with tf.Session() as sess:
    yv, yuv, zv, gv = sess.run([y,yu,z,g], {x: [-2, -1, 1]})

print(yv)
print(yuv)
print(zv)
print(gv)

which outputs something like

[4. 1. 1.]
[4.1626534 0.9370764 1.0806011]
[2.0402582  0.96802706 1.0395197 ]
[-1.0201291  -0.96802706  1.0395197 ]

The last values here are the derivative of z with respect to x. I would like them to not include the multiplicative noise term u, i.e. they should consistently be [-1, -1, 1] for these input values of x.

Is there a way to do such a thing only using Python? I know I can make a custom operator in C and define a custom gradient for it, but I'd like to avoid this if possible.

Also, I'm hoping to use this as part of a Keras layer, so a Keras-based solution would be an alternative (i.e. if one could define a different expression for the forwards and backwards pass through a Keras layer). This does mean that just defining a second expression z2 = tf.sqrt(y) and calling gradients on that isn't a solution for me, though, because I don't know how I would put that in Keras (since in Keras, it will be part of a very long computational graph).

Solution

The short answer is that Sergey Ioffe's trick, which you mentioned above, will only work if it's applied at the very end of the graph, right before the gradient computation.

I am assuming that you tried the following, which will not work:

yu_fixed = tf.stop_gradient(yu - y) + y
z = tf.sqrt(yu_fixed)

This still outputs random-tainted gradients.

To see why, let's follow along the gradient computation. Let's use s as shorthand for tf.stop_gradient. The way this works is that when TensorFlow needs to compute s(expr), it just returns expr, but when it needs to compute the gradient of s(expr), it returns 0.

We want to compute the gradient of z = sqrt(s(yu - y) + y). Now, because $\frac{\partial \sqrt{f(x)}}{\partial x} = \frac{1}{2\sqrt{f(x)}} \frac{\partial f(x)}{\partial x}$ , we find that the gradient of z contains both a term with the derivative of s(), but also a term containing s() itself. This latter term will not zero out the s() portion, so the computed derivative of z will depend (in some odd and incorrect way) on the value yu. This is why the above solution still contains randomness in the gradient.

As far as I can see, the only way to work around this is to apply Ioffe's trick as the last stage before the tf.gradient. In other words, if you do something like the following, you will get the expected result:

x = tf.placeholder(tf.float32)
y = x**2
u = tf.random_uniform(tf.shape(x), minval=0.9, maxval=1.1)
yu = y * u
z = tf.sqrt(yu)
z_fixed = tf.stop_gradient(z - tf.sqrt(y)) + tf.sqrt(y)
g = tf.gradients(z_fixed, x)[0]

with tf.Session() as sess:
    yv, yuv, zv, gv = sess.run([y,yu,z_fixed,g], {x: [-2, -1, 1]})

print(yv)
print(yuv)
print(zv)
print(gv)

Output:

[ 4.  1.  1.]
[ 3.65438652  1.07519293  0.94398856]
[ 1.91164494  1.03691506  0.97159076]
[-1. -1.  1.]