Why are these models different? Does keras normalize gradients?

I was testing some aleatory normalization in models and I found something quite strange.

For models that output exactly the same result, but use techniques to have greater weights, the training speed is incredibly different.

Why does this happen? If weights are greater, gradients should be greater and at the end the result should be the same training speed. Where is the hidden magic keras or tensorflow is doing that causes this change?

Details:

The models

Here, I define 3 models, 2 of them have their intermediate tensors divided by a value and, in compensation, their weights are multiplied by the same value.

Considering that each neuron's output is a variation of w1*i1 + w2*i2 + w3*i3 ..., if I multiply all weights by a value and divide all inputs by the same value, the result is exactly the same. (where: w = weight ; i = input)

I removed the biases from all layers so they don't influence the results.

Model 1 - Unchanged model

inp1 = Input((smallSide,bigSide,3))
out1 = Conv2D(200,3,activation='tanh', use_bias=False, padding = 'same', name="conv1")(inp1)
out1 = Conv2D(1,3,activation='sigmoid', use_bias=False, padding = 'same', name="conv2")(out1)
model1 = Model(inp1,out1)

Model 2 - Before the activation, divide outputs by the number of inputs

inp2 = Input((smallSide,bigSide,3))

out2 = Conv2D(200,3,activation='linear', use_bias=False, padding = 'same', name="conv1a")(inp2)
out2 = Lambda(lambda x: x/3.)(out2)
out2 = Activation('tanh')(out2)

out2 = Conv2D(1,3,activation='linear', use_bias=False, padding = 'same', name="conv2a")(out2)
out2 = Lambda(lambda x: x/200.)(out2)
out2 = Activation('sigmoid')(out2)

model2 = Model(inp2,out2)

Model 3 (should be the same as 2, in a different order)

inp3 = Input((smallSide,bigSide,3))
out3 = Lambda(lambda x: x/3.)(inp3)

out3 = Conv2D(200,3,activation='tanh', use_bias=False, padding = 'same', name="conv1b")(out3)
out3 = Lambda(lambda x: x/200.)(out3)

out3 = Conv2D(1,3,activation='sigmoid', use_bias=False, padding = 'same', name="conv2b")(out3)

model3 = Model(inp3,out3)

Compiling and adjusting the weights

Compiling, same configs for all models:

model1.compile(optimizer='adam', loss='binary_crossentropy')
model2.compile(optimizer='adam', loss='binary_crossentropy')
model3.compile(optimizer='adam', loss='binary_crossentropy')

Here, I transfer weights from model 1 to the others, applying the correct multiplication factor to compensate the divided outputs:

model2.get_layer('conv1a').set_weights([3 * model1.get_layer('conv1').get_weights()[0]])
model2.get_layer('conv2a').set_weights([200 * model1.get_layer('conv2').get_weights()[0]])

model3.get_layer('conv1b').set_weights([3 * model1.get_layer('conv1').get_weights()[0]])
model3.get_layer('conv2b').set_weights([200 * model1.get_layer('conv2').get_weights()[0]])

Assuring equality

Here, I test the outputs of each model to see that they're equal:

y1 = model1.predict(X[:10])
y2 = model2.predict(X[:10])
y3 = model3.predict(X[:10])

inspectValues(y1-y2) #this is a custom function that prints min, max and mean
inspectValues(y1-y3)
inspectValues(y2-y3)

The outputs are:

inspecting values:
    shape: (10, 64, 96, 1)
    min: -1.19209e-07
    max: 1.19209e-07
    mean: -2.00477e-09
inspecting values:
    shape: (10, 64, 96, 1)
    min: -1.19209e-07
    max: 5.96046e-08
    mean: -2.35159e-09
inspecting values:
    shape: (10, 64, 96, 1)
    min: -1.19209e-07
    max: 1.19209e-07
    mean: -3.46821e-10

We can see that values are virtually the same, considering the output range is from 0 to 1.

Training differences

Here I quickly train the three models and there is a significant reproductible difference where model1 is always way ahead of the others. Why does this happen?

for epoch in range(20):
    print("\n\n\nfitting model 3")
    model3.fit(X,Y,epochs=2)
    print("\n\n\nfitting model 1")
    model1.fit(X,Y,epochs=2)
    print("\n\n\nfitting model 2")
    model2.fit(X,Y,epochs=2)

Outputs:

fitting model 3
Epoch 1/2
5088/5088 [==============================] - 302s 59ms/step - loss: 0.1057
Epoch 2/2
5088/5088 [==============================] - 300s 59ms/step - loss: 0.0260

fitting model 1
Epoch 1/2
5088/5088 [==============================] - 284s 56ms/step - loss: 0.0280
Epoch 2/2
5088/5088 [==============================] - 282s 55ms/step - loss: 0.0111

fitting model 2
Epoch 1/2
5088/5088 [==============================] - 296s 58ms/step - loss: 0.1059
Epoch 2/2
5088/5088 [==============================] - 296s 58ms/step - loss: 0.0260

fitting model 3
Epoch 1/2
5088/5088 [==============================] - 300s 59ms/step - loss: 0.0187
Epoch 2/2
5088/5088 [==============================] - 301s 59ms/step - loss: 0.0155

fitting model 1
Epoch 1/2
5088/5088 [==============================] - 281s 55ms/step - loss: 0.0110
Epoch 2/2
5088/5088 [==============================] - 283s 56ms/step - loss: 0.0105

fitting model 2
Epoch 1/2
5088/5088 [==============================] - 294s 58ms/step - loss: 0.0187
Epoch 2/2

Solution

You are wrong in assuming that the gradients won't change.

Assume this simplified model for the last layer: single neuron, no activation. In a first case, the output is

y = w.h

where h is the output of the previous layer. We have dy/dw = h.

Now let's introduce a scaling factor λ,

y = λ.w.h

Now the derivative of the output is dy/dw = λ.h. It does not matter that the value of w itself is scaled by 1/λ.

To get the same gradient magnitude, you would actually need to scale the output of the previous layer h by a factor of 1/λ. But since you are preserving the scale of the output, this does not happen.