python-3.x keras deep-learning tensorflow2.0 gradient-descent

Can't use combination of gradiants for multiple losses functions of a multi-output keras model

I am doing a time-series forecasting in Keras with a CNN and the EHR dataset. The goal is to predict both what molecule to give to the patient and the time until the next patient visit. I have to implement a bi-objective gradient descent based on this paper. The algorithm to implements is here (end of page 7, the beginning of page 8):

The model I choose is this one :

With time-series of length 3 as input (correspondings to 3 consecutive visits for a client) And 2 outputs:

the atc code (the code of the molecule to predict)
the time to wait until the next visit (in categories of months: 0,1,2,3,4 for >=4)

both outputs use the SparseCategoricalCorssentropy loss function.

when I start to implement the first operation: gs - gl I have this error :

Some values in my gradients are at None and I don't know why. My optimizer is defined as follow: optimizer=tf.Keras.optimizers.Adam(learning_rate=1e-3 when compiling my model.

Also, when I try some operations on gradients to see how things work, I have another problem: only one input is taken into account which will pose a problem later because I have to consider each loss function separately:

With this code, I have this output message : WARNING:tensorflow:Gradients do not exist for variables ['outputWaitTime/kernel:0', 'outputWaitTime/bias:0'] when minimizing the loss.

EPOCHS = 1

for epoch in range(EPOCHS):
    with tf.GradientTape() as ATCTape, tf.GradientTape() as WTTape:
        predictions = model(xTrain,training=False)
        ATCLoss = loss(yTrain[:,:,0],predictions[ATC_CODE])
        WTLoss = loss(yTrain[:,:,1],predictions[WAIT_TIME])

    ATCGrads = ATCTape.gradient(ATCLoss, model.trainable_variables)
    WTGrads  = WTTape.gradient(WTLoss,model.trainable_variables)
    grads = ATCGrads + WTGrads

    model.optimizer.apply_gradients(zip(grads, model.trainable_variables))

With this code, it's okay, but both losses are combined into one, whereas I need to consider both losses separately

EPOCHS = 1

for epoch in range(EPOCHS):
    with tf.GradientTape() as tape:
        predictions = model(xTrain,training=False)
        ATCLoss = loss(yTrain[:,:,0],predictions[ATC_CODE])
        WTLoss = loss(yTrain[:,:,1],predictions[WAIT_TIME])
        lossValue = ATCLoss + WTLoss

    grads = tape.gradient(lossValue, model.trainable_variables)

    model.optimizer.apply_gradients(zip(grads, model.trainable_variables))

I need help to understand why I have all of those problems.

The notebook containing all the code is here: https://colab.research.google.com/drive/1b6UorAAEddNKFQCxaK1Wsuj09U645KhU?usp=sharing

The implementation begins in the part Model Creation

Solution

The reason you get None in ATCGrads and WTGrads is because two gradients corresponding loss is wrt different outputs outputATC and outputWaitTime, if outputs value is not using to calculate the loss then there will be no gradients wrt that outputs hence you get None gradients for that output layer. That is also the reason why you get WARNING:tensorflow:Gradients do not exist for variables ['outputWaitTime/kernel:0', 'outputWaitTime/bias:0'] when minimizing the loss, because you don't have those gradients wrt each loss. If you combine losses into one then both outputs are using to calculate the loss, thus no WARNING.

So if you want do a list element wise subtraction, you could first convert None to 0. before subtraction, and you cannot using tf.math.subtract(gs, gl) because it require shapes of all inputs must match, so:

import tensorflow as tf

gs = [tf.constant([1., 2.]), tf.constant(3.), None]
gl = [tf.constant([3., 4.]), None, tf.constant(4.)]

to_zero = lambda i : 0. if i is None else i
gs = list(map(to_zero, gs))
gl = list(map(to_zero, gl))
sub = [s_i - l_i for s_i, l_i in zip(gs, gl)]
print(sub)

Outpts:

[<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-2., -2.], dtype=float32)>, 
<tf.Tensor: shape=(), dtype=float32, numpy=3.0>, 
<tf.Tensor: shape=(), dtype=float32, numpy=-4.0>]

Also beware the tape.gradient() will return a list or nested structure of Tensors (or IndexedSlices, or None), one for each element in sources. Returned structure is the same as the structure of sources; Add two list [1, 2] + [3, 4] in python will not give you [4, 6] like you do in numpy array, instead it will combine two list and give you [1, 2, 3, 4].