Understanding tensorboard: why 12 tensors sent to optimizer?

So I made the simplest model I could (a perceptron/autoencoder) which (aside from input generation) is the following:

N = 64 * 64 * 3

def main():
    x = tf.placeholder(tf.float32, shape=(None, 64, 64, 3), name="x")

    with tf.name_scope("perceptron"):
        W = tf.Variable(tf.random_normal([N, N], stddev=1), name="W")
        b = tf.Variable(tf.random_normal([], stddev=1), name="b")
        y = tf.add(tf.matmul( tf.reshape(x, [-1,N]), W), b, name="y")
        act = tf.nn.sigmoid(y, name="sigmoid")
        yhat = tf.reshape(act, [-1, 64, 64, 3], name="yhat")

    with tf.name_scope("mse"):
        sq_error = tf.reduce_mean(np.square(x - yhat), axis=1)
        cost = tf.reduce_mean( sq_error, name="cost" )
        tf.summary.scalar("cost", cost)

    with tf.name_scope("conv_opt"): #Should just be called 'opt' here
        training_op = tf.train.AdamOptimizer(0.005).minimize(cost, name="train_op")

    with tf.device("/gpu:0"):
        config = tf.ConfigProto(allow_soft_placement=True)
        config.gpu_options.allow_growth = True
        sess = tf.Session(config=config)
        sess.run(tf.global_variables_initializer())

        logdir = "log_directory"
        if os.path.exists(logdir):
            shutil.rmtree(logdir)
        os.makedirs(logdir)

        input_gen = input.input_generator_factory(...)
        input_gen.initialize((64,64,3), 512)

        merged = tf.summary.merge_all()
        train_writer = tf.summary.FileWriter(logdir, sess.graph)

        for i in range(10):
            batch = input_gen.next_train_batch()
            summary,_ = sess.run([merged, training_op], feed_dict={x : batch})
            train_writer.add_summary(summary, i)
            print("Iteration %d completed" % (i))

if __name__ == "__main__":
    main()

This produces the following tensorboard graph. Anyway I presume the thick arrow from 'perception' to 'conv_opt' (which should probably just be called 'opt', sorry) corresponds to the back-propagating error signal, (whereas ?x64x64x3 arrows correspond to inference). But why 12 tensors? I don't see where that number comes from. I would have expected fewer, corresponding to really just to W, and b. Can someone please explain what's going on?

Solution

I think the reason is that when you add the tf.train.AdamOptimizer(0.005).minimize(cost) op, it is implicitly assumed that you optimize over all trainable variables (because you didn't specify otherwise). Therefore, you need to know the values of these variables and of all the intermediate tensors which take part in the calculation of cost, including the gradients (which are tensors too and are implicitly added to the computational graph). Now lets count the variables and tensors from perceptron:

W
b
tf.reshape(x, [-1,N])
tf.matmul( ..., W)
its gradient with respect to the first argument.
its gradient with respect to the second argument.
tf.add(..., b, name="y")
its gradient with respect to the first argument.
its gradient with respect to the second argument.
tf.nn.sigmoid(y, name="sigmoid")
its gradient.
tf.reshape(act, [-1, 64, 64, 3], name="yhat")

I'm not actually 100% sure that this is how the accounting is done, but you get the idea of where the number 12 could have come from.

Just as an exercise, we can see that this type of accounting also explains where the number 9 comes from in your chart:

x - yhat
its gradient with respect to the first argument
its gradient with respect to the second argument
np.square(...)
its gradient
tf.reduce_mean(..., axis=1)
its gradient
tf.reduce_mean( sq_error, name="cost" )
its gradient