I'm manually collecting the gradients statistics of the Multi-Tasking model, the graph of which looks schematically like this:
input -> [body_var1 ... body_varN] --> [task1_var1 ... task1_varM] <-- loss_1
\-> [task2_var1 ... task2_varM] <-- loss_2
I'm defining a separate optimizer for each loss as follows (the actual code is much complicated, the following is simplified for this question):
# for simplicity, just demonstrate the case with the 1st task
task_index = 1
# here we define the optimizer (create an instance in graph)
loss = losses[task_index]
optimizer = tf.train.GradientDescentOptimizer()
grads_and_vars = optimizer.compute_gradients(loss)
# now let's see what it returns
for g, v in grads_and_vars:
print(' grad:', g, ', var:', v)
So, the code above clearly creates a separated optimizer only for the branch of task 1, then we create the gradient computation ops with optimizer.compute_gradients(loss)
and print the vars to which we apply the gradients to.
Expected results:
grad: body_var1_grad, var: body_var1 # \
... # --> body vars and gradients
grad: body_varN_grad, var: body_varN # /
grad: task1_var1_grad, var: task1_var1 # \
... # --> task 1 vars and gradients
grad: task1_var1_grad, var: task1_var1 # /
So I'm expecting that the optimizer only contains gradient computing ops for the branch it was applied on (i.e. the branch for 1st task)
Actual results
grad: body_var1_grad, var: body_var1 # \
... # --> body vars and gradients
grad: body_varN_grad, var: body_varN # /
grad: task1_var1_grad, var: task1_var1 # \
... # --> task 1 vars and gradients
grad: task1_var1_grad, var: task1_var1 # /
grad: None, var: task2_var1 # \
... # --> task 2 vars, with None gradients
grad: None, var: task2_var1 # /
So it looks like optimizer.compute_gradients(loss)
captures not only the sub-graph that outputs to loss
(which can be extracted using tf.graph_util.extract_sub_graph
), but also all trainable variables that are connected to loss
without creating a gradient variable for them (so the returned gradient variables are None
).
Question: is such behavior normal?
Yes, it is, because compute_gradients() computes gradients of loss
with respect to a list of tf.Variable
objects which is passed to the var_list
parameter. If var_list
is not provided, the function calculates gradients with respect to all variables from GraphKeys.TRAINABLE_VARIABLES collection. Also, if loss
does not depend on certain variables, the gradients of loss
with respect to those variables are not defined, i.e. None
is returned. Based on the code you provided, this seems to be the case.
If you want the optimizer
to calculate gradients with respect to certain variables only, you should make a list of such variables and pass it to the var_list
parameter of compute_gradients()
.