Search code examples
pythontensorflowkerastf.kerasmlflow

How to I track loss at epoch using mlflow/tensorflow?


I want to use mlflow to track the development of a TensorFlow model. How do I log the loss at each epoch? I have written the following code:

mlflow.set_tracking_uri(tracking_uri)

mlflow.set_experiment("/deep_learning")
with mlflow.start_run():
    mlflow.log_param("batch_size", batch_size)
    mlflow.log_param("learning_rate", learning_rate)
    mlflow.log_param("epochs", epochs)
    mlflow.log_param("Optimizer", opt)
    mlflow.log_metric("train_loss", train_loss)
    mlflow.log_metric("val_loss", val_loss)
    mlflow.log_metric("test_loss", test_loss)
    mlflow.log_metric("test_mse", test_mse)
    mlflow.log_artifacts("./model")

If I change the train_loss and val_loss to

train_loss = history.history['loss']
val_loss = history.history['val_loss']

I get the following error:

mlflow.exceptions.MlflowException: Got invalid value [12.041399002075195] for metric 'train_loss' (timestamp=1649783654667). Please specify value as a valid double (64-bit floating point)

How to I save the the loss and the val_loss at all epochs, so I can visualise a learning curve within mlflow?


Solution

  • As you can read here. You can use mlflow.tensorflow.autolog() and this, (from doc):

    Enables (or disables) and configures autologging from Keras to MLflow. Autologging captures the following information:

    fit() or fit_generator() parameters; optimizer name; learning rate; epsilon ...

    For example:

    # !pip install mlflow
    import tensorflow as tf
    import mlflow
    import numpy as np
    
    
    X_train = np.random.rand(100,100)
    y_train = np.random.randint(0,10,100)
        
    
    model = tf.keras.Sequential()
    model.add(tf.keras.Input(100,))
    model.add(tf.keras.layers.Dense(256, activation='relu'))
    model.add(tf.keras.layers.Dropout(rate=.4))
    model.add(tf.keras.layers.Dense(10, activation='sigmoid'))        
    model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  optimizer='Adam', 
                  metrics=['accuracy'])
    model.summary()
    
    
    mlflow.tensorflow.autolog()
    history = model.fit(X_train, y_train, epochs=100, batch_size=50)
    

    Or as you mention in the comment you can use mlflow.set_tracking_uri() like below:

    mlflow.set_tracking_uri('http://127.0.0.1:5000')
    tracking_uri = mlflow.get_tracking_uri()
    with mlflow.start_run(run_name='PARENT_RUN') as parent_run:
        batch_size=50
        history = model.fit(X_train, y_train, epochs=2, batch_size=batch_size)
        mlflow.log_param("batch_size", batch_size)  
    

    For getting results:

    !mlflow ui
    

    Output:

    [....] [...] [INFO] Starting gunicorn 20.1.0
    [....] [...] [INFO] Listening at: http://127.0.0.1:5000 (****)
    [....] [...] [INFO] Using worker: sync
    [....] [...] [INFO] Booting worker with pid: ****
    

    enter image description here enter image description here