python tensorflow tensorflow-probability

fail to run a probalistic tensorflow model

I build a test of tensorflow lstm 2 heads and 2 ouputs model with oune of the output is probalilist. This odel work fine. I do the same work but adding more layers, following the same procedure... but this one fail with this error

2023-07-15 09:18:24.407504: W tensorflow/core/common_runtime/bfc_allocator.cc:491] ***********________***********________************______________________________________************
2023-07-15 09:18:24.408219: E tensorflow/stream_executor/dnn.cc:868] OOM when allocating tensor with shape[2866176000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2023-07-15 09:18:24.409011: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at cudnn_rnn_ops.cc:1564 : INTERNAL: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 240, 240, 1, 120, 19904, 240] 
Traceback (most recent call last):
  File "E:\Anaconda3\envs\tf2.7_bigData\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "E:\Anaconda3\envs\tf2.7_bigData\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "Extracteur_feature2" "                 f"(type LSTM).

{{function_node __wrapped__CudnnRNN_device_/job:localhost/replica:0/task:0/device:GPU:0}} Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 240, 240, 1, 120, 19904, 240]  [Op:CudnnRNN]

Call arguments received by layer "Extracteur_feature2" "                 f"(type LSTM):
  • inputs=tf.Tensor(shape=(19904, 120, 240), dtype=float32)
  • mask=None
  • training=False
  • initial_state=None

Process finished with exit code 1

The model is build like this

def build_model(num_timesteps_in, nb_features, nb_attributs, nb_lstm_units, probalistic_model=True):
    """
    Construire un modèle avec tensorflow
    :param num_timesteps_in : combien de jours d'observation incluant le jour à prévoir en input
    :param nb_features : nombre de feature utilisé en input (exclus les apports si non assimilés)
    :param nb_attributs : nombre de d'attibut physiographique utilisé en input
    :param nb_lstm_units : nombre de neuronnes par layer
    :return : un modèle et le checkpoint pour lancer le trainning
    """

    # allocated memory on demand in lieu of full charge memory
    gpu_devices = tf.config.experimental.list_physical_devices("GPU")
    for device in gpu_devices:
        tf.config.experimental.set_memory_growth(device, True)


    def negative_loglikelihood(targets, estimated_distribution):
        return -estimated_distribution.log_prob(targets)

    tfd = tfp.distributions

    timeseries_input = tf.keras.Input(shape=(num_timesteps_in, nb_features))
    attrib_input = tf.keras.Input(shape=(nb_attributs,))

    xy = tf.keras.layers.LSTM(nb_lstm_units,  # activation='softsign'
                              kernel_initializer=tf.keras.initializers.glorot_uniform(),
                              return_sequences=True, stateful=False,
                              name='Extracteur_feature1')(timeseries_input)

    xy = tf.keras.layers.Dropout(0.2)(xy)

    xy = tf.keras.layers.LSTM(nb_lstm_units,  # activation='softsign'
                              kernel_initializer=tf.keras.initializers.glorot_uniform(),
                              return_sequences=True, stateful=False,
                              name='Extracteur_feature2')(xy)

    xy = tf.keras.layers.Dropout(0.2)(xy)

    xy = tf.keras.layers.LSTM(nb_lstm_units,  # activation='softsign'
                              kernel_initializer=tf.keras.initializers.glorot_uniform(),
                              return_sequences=False, stateful=False,
                              name='Extracteur_feature3')(xy)

    xy = tf.keras.layers.Dropout(0.2)(xy)

    allin_input = tf.keras.layers.Concatenate(axis=1, name='merged_head')([xy, attrib_input])

    allin_input = tf.keras.layers.Dense(nb_attributs, activation='softsign',
                                   kernel_initializer=tf.keras.initializers.he_uniform(),
                                   name='Dense111')(allin_input)

    allin_input = tf.keras.layers.Dropout(0.2)(allin_input)

    allin_input = tf.keras.layers.Dense(nb_attributs, activation='softsign',
                                   kernel_initializer=tf.keras.initializers.he_uniform(),
                                   name='Dense222')(allin_input)

    outputs = tf.keras.layers.Dropout(0.2)(allin_input)
    if probalistic_model:


        ################### block probability ##########################
        prevision = tf.keras.layers.Dense(1, activation='linear', name='deterministe_1')(outputs)
        probabilist = tf.keras.layers.Dense(2, activation='linear', name='probabilist_2')(outputs)

        probabilist = tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t[..., :1],
                                                                         scale=1e-3 + tf.math.softplus(
                                                                             0.05 * t[..., 1:])),
                                                    name='normal_dist')(probabilist)  # note this
        # 1e-3 pour éviter des prob. numériques
        # 0.5 pas clair, possiblement aide a accélérer l'optimisation éviter minimum locaux...
        # https://github.com/tensorflow/probability/issues/703

        ################### fin block probability ##########################

        model = tf.keras.Model(inputs=[timeseries_input, attrib_input], outputs=[prevision, probabilist])

        model.summary()
        # avec adam [.001 à .0005] résultats ET vitesse optimum
        optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
        loss = {'deterministe_1': 'mse', 'normal_dist': negative_loglikelihood}

        model.compile(optimizer=optimizer, loss=loss,
                      loss_weights=[1, 1])

    else:
        outputs = tf.keras.layers.Dense(1, activation='linear', name='deterministe')(allin_input)

        model = tf.keras.Model(inputs=[timeseries_input, attrib_input], outputs=outputs)

        model.summary()

        optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)  # avec adam [.001 à .0005] résultats ET vitesse optimum
        loss = 'mse'
        model.compile(optimizer=optimizer, loss=loss)

    return model, optimizer, loss

I do trainning in many step, so I reload the best itteration and re-compile it due to a custom loss fonction who give me prob. if done in an other way. The loss_fct and optimizer are define earlier as a copie of the one used in build_model.

    def negative_loglikelihood(targets, estimated_distribution):
        return -estimated_distribution.log_prob(targets)

    loss = {'deterministe_1': 'mse', 'normal_dist': negative_loglikelihood}

model = tensorflow.keras.models.load_model('path_to_model/model.h5',
                                              compile=False)
model.compile(optimizer=optimizer, loss=loss_fct)

I don't understand this error, this model was test many time without tf.probability and work fine (lstm input shpe are ok...). What is news is adding second output of tf.probability (who work fine in a simpless version) and reload with compile=False and recompile (who work fine too with simpless model)

I work on this problem since 3 weeks and I'm out of idea to try...

tensorflow 2.10 tensorflow-probability 0.14.0 window/Anaconda

Solution

I finnaly succeeded by update to tf=2.13 and update tfp