stacked sigmoids: why training the second layer alters the first layer?

I am training a NN with sigmoid layers stacked one on top of the other. I have labels associated with each layer and I would like to alternate between training towards minimizing the loss for the first layer and minimizing the loss on the second layer. I expect that the result I get on the first layer would not change regardless whether I train for the second layer or not. However, I do get significant difference. What am I missing?

Here is the code:

dim = Xtrain.shape[1]
output_dim = Ytrain.shape[1]
categories_dim = Ctrain.shape[1]
features = C.input_variable(dim, np.float32)
label = C.input_variable(output_dim, np.float32)
categories = C.input_variable(categories_dim, np.float32)
b = C.parameter(shape=(output_dim))
w = C.parameter(shape=(dim, output_dim))
adv_w = C.parameter(shape=(output_dim, categories_dim))
adv_b = C.parameter(shape=(categories_dim))
pred_parameters = (w, b)
adv_parameters = (adv_w, adv_b)
z = C.tanh(C.times(features, w) + b)
adverse = C.tanh(C.times(z, adv_w) + adv_b)
pred_loss = C.cross_entropy_with_softmax(z, label)
pred_error = C.classification_error(z, label)

adv_loss = C.cross_entropy_with_softmax(adverse, categories)
adv_error = C.classification_error(adverse, categories)

pred_learning_rate = 0.5
pred_lr_schedule = C.learning_rate_schedule(pred_learning_rate, C.UnitType.minibatch) 
pred_learner = C.adam(pred_parameters, pred_lr_schedule, C.momentum_as_time_constant_schedule(0.9))
pred_trainer = C.Trainer(adverse, (pred_loss, pred_error), [pred_learner])

adv_learning_rate = 0.5
adv_lr_schedule = C.learning_rate_schedule(adv_learning_rate, C.UnitType.minibatch) 
adv_learner = C.adam(adverse.parameters, adv_lr_schedule, C.momentum_as_time_constant_schedule(0.9))
adv_trainer = C.Trainer(adverse, (adv_loss, adv_error), [adv_learner])

minibatch_size = 50
num_of_epocs = 40

# Run the trainer and perform model training
training_progress_output_freq = 50

def permute (x, y, c):
    rr = np.arange(x.shape[0])
    np.random.shuffle(rr)
    x = x[rr, :]
    y = y[rr, :]
    c = c[rr, :]
    return (x, y, c)
for e in range(0,num_of_epocs):
    (x, y, c) = permute(Xtrain, Ytrain, Ctrain)
    for i in range (0, x.shape[0], minibatch_size):
        m_features = x[i:min(i+minibatch_size, x.shape[0]),]
        m_labels = y[i:min(i+minibatch_size, x.shape[0]),]
        m_cat = c[i:min(i+minibatch_size, x.shape[0]),]

        if (e % 2 == 0):
            pred_trainer.train_minibatch({features : m_features, label : m_labels, categories : m_cat, diagonal : m_diagonal})
        else:
            adv_trainer.train_minibatch({features : m_features, label : m_labels, categories : m_cat, diagonal : m_diagonal})

I am surprised that if I comment out the last two lines (else: adv_training.train...) the train and test error of z in predicting the label alters. Since adv_trainer is supposed to modify only adv_w and adv_b which are not used in computing z or its loss, I can't see why should that happen. I appreciate the assistance.

Solution

You shouldn’t do

adv_learner = C.adam(adverse.parameters, adv_lr_schedule, C.momentum_as_time_constant_schedule(0.9))

but:

adv_learner = C.adam(adv_parameters, adv_lr_schedule, C.momentum_schedule(0.9))

adverse.parameters contains all the parameters and you don’t want that. On a different note, you will want to replace momentum_as_time_constant_schedule with momentum_schedule. The former takes as parameter the number of samples after which the contribution of the gradient will have decayed by exp(-1).