Search code examples
pythontensorflowtf-slim

Why does `optimizer.minimize()` not return loss with `tf.slim.learning.train()`?


I'm using tf-slim to finetune a network, vgg16. I'd like to manually manipulate the gradients by applying a different learning rate to the last layer. But when I try to use opt.minimize(), or tf.gradients() and opt.apply_gradients() I get a None value for the loss in the summary reporting

Why does this code path for train_op work:

optimizer = tf.train.GradientDescentOptimizer( learning_rate=.001 )
train_op = slim.learning.create_train_op(total_loss, optimizer,
                                        global_step=global_step)

slim.learning.train(train_op, log_dir, 
                    init_fn=init_fn,
                    global_step=global_step,
                    number_of_steps=25,
                    save_summaries_secs=300,
                    save_interval_secs=600                       
                   )

But manually creating the train_op fails with exception below (e.g. total_loss is None):

trainable = tf.trainable_variables()
optimizer = tf.train.GradientDescentOptimizer(learning_rate=.001)
train_op = optimizer.minimize( total_loss, global_step=global_step )


# exception: appears that loss is None
--- Logging error ---
Traceback (most recent call last):
...
  File "/anaconda/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 755, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/anaconda/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 506, in train_step
    np_global_step, total_loss, time_elapsed)
  File "/anaconda/anaconda3/lib/python3.6/logging/__init__.py", line 338, in getMessage
    msg = msg % self.args
TypeError: must be real number, not NoneType
...
Message: 'global step %d: loss = %.4f (%.3f sec/step)'
Arguments: (29, None, 51.91366386413574)

what am I doing wrong here?


Solution

  • The issue is that, despite the name create_train_op(), slim creates a different return type than the usual definition of train_op, which is what you have used in the second case when you use the "non-slim" call:

    optimizer.minimize( total_loss, global_step=global_step )
    

    Try for example this:

    optimizer = tf.train.GradientDescentOptimizer( learning_rate=.001 )
    train_op_no_slim = optimizer.minimize(total_loss)
    train_op = slim.learning.create_train_op(total_loss, optimizer)
    print(train_op_no_slim)
    print(train_op) 
    

    For the first, I get the "usual" (in tensorflow):

    name: "GradientDescent_2"
    op: "NoOp"
    input: "^GradientDescent_2/update_layer1/weight1/ApplyGradientDescent"
    input: "^GradientDescent_2/update_layer1/bias1/ApplyGradientDescent"
    input: "^GradientDescent_2/update_layer2/weight2/ApplyGradientDescent"
    input: "^GradientDescent_2/update_layer2/bias2/ApplyGradientDescent"
    input: "^GradientDescent_2/update_layer3/weight3/ApplyGradientDescent"
    input: "^GradientDescent_2/update_layer3/bias3/ApplyGradientDescent"
    

    For the second print statement, I get:

    Tensor("train_op_1/control_dependency:0", shape=(), dtype=float32)
    

    In short, slim.learning.create_train_op does not have the same return type as optimizer.minimize().

    To fix this: your use of a directly defined train_op is taking you out of standard slim territory. I suggest embracing that and just operating on the directly defined train_op in the non-slim fashion, using sess.run() or train_op.run() as in a typical (non-slim) tensorflow example.