Search code examples
databricksmlflownetflix-metaflow

Combination of Metaflow and MLflow within Databricks


I need to use Databricks-Notebooks for writing a script which combines Metaflow and Mlflow.

This is the script:

import mlflow
from metaflow import FlowSpec, step, Parameter
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris


class TrainFlow(FlowSpec):

    @step
    def start(self):
        iris = load_iris()
        iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

        X_train, X_test, y_train, y_test = train_test_split(iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']], iris_df['target'])

        # Create a model
        model = Ridge(alpha=0.1)

        # Train the model on the training data
        model.fit(X_train, y_train)

        # Make predictions on the testing data
        y_pred = model.predict(X_test)

        # Evaluate the model on the testing data
        accuracy = model.score(X_test, y_test)

        self.next(self.end)

    @step
    def end(self):
        print('End of flow')

if __name__ == "__main__":
    TrainFlow()

I execute this script using this command within a Databricks-Notebook cell:

%env USERNAME='xyz'
!python /dbfs/FileStore/xxx/metaflow_mlflow_workflow.py --no-pylint run

This script is running fine.

Now, I add MLflow to the script:

import mlflow
from metaflow import FlowSpec, step, Parameter
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris


class TrainFlow(FlowSpec):

    @step
    def start(self):
        iris = load_iris()
        iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

        X_train, X_test, y_train, y_test = train_test_split(iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']], iris_df['target'])

        # Create a model
        model = Ridge(alpha=0.1)

        # Train the model on the training data
        model.fit(X_train, y_train)

        # Make predictions on the testing data
        y_pred = model.predict(X_test)

        # Evaluate the model on the testing data
        accuracy = model.score(X_test, y_test)
        
        # Set the experiment name
        experiment_name = "Iris Classification"

        # Log the metrics and model using MLflow
        with mlflow.start_run(run_name = experiment_name):
        
            mlflow.log_metric("accuracy_mean", 0.1)
            mlflow.log_metric("accuracy_std", 0.2)

            # Log the model's hyperparameters
            mlflow.log_param("random_state", 0.3)
            mlflow.log_param("n_estimators", 0.4)
            mlflow.log_param("eval_metric", 0.5)
            mlflow.log_param("k_fold", 0.6)

        self.next(self.end)

    @step
    def end(self):
        print('End of flow')

if __name__ == "__main__":
    TrainFlow()

As before, I execute this script using this command within a Databricks-Notebook cell:

%env USERNAME='xyz'
!python /dbfs/FileStore/xxx/metaflow_mlflow_workflow.py --no-pylint run

Unfortunately, the script crashes and I get this error:

env: USERNAME='xyz'
Metaflow 2.8.0 executing TrainFlow for user:'xyz'
Validating your flow...
    The graph looks good!
2023-04-06 07:50:51.288 Workflow starting (run-id 1680767451283182):
2023-04-06 07:50:51.302 [1680767451283182/start/1 (pid 2012)] Task is starting.
2023-04-06 07:50:53.940 [1680767451283182/start/1 (pid 2012)] <flow TrainFlow step start> failed:
2023-04-06 07:50:53.945 [1680767451283182/start/1 (pid 2012)] Internal error
2023-04-06 07:50:53.946 [1680767451283182/start/1 (pid 2012)] Traceback (most recent call last):
2023-04-06 07:50:53.946 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/cli.py", line 1172, in main
2023-04-06 07:50:53.946 [1680767451283182/start/1 (pid 2012)] start(auto_envvar_prefix="METAFLOW", obj=state)
2023-04-06 07:50:53.946 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/_vendor/click/core.py", line 829, in __call__
2023-04-06 07:50:53.946 [1680767451283182/start/1 (pid 2012)] return self.main(args, kwargs)
2023-04-06 07:50:54.223 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/_vendor/click/core.py", line 782, in main
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] rv = self.invoke(ctx)
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/_vendor/click/core.py", line 1259, in invoke
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/_vendor/click/core.py", line 1066, in invoke
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] return ctx.invoke(self.callback, ctx.params)
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/_vendor/click/core.py", line 610, in invoke
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] return callback(args, kwargs)
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/_vendor/click/decorators.py", line 21, in new_func
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] return f(get_current_context(), args, kwargs)
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/cli.py", line 581, in step
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] task.run_step(
2023-04-06 07:50:54.224 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/task.py", line 586, in run_step
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] self._exec_step_function(step_func)
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/metaflow/task.py", line 60, in _exec_step_function
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] step_function()
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] File "/dbfs/FileStore/xxx/metaflow_mlflow_workflow.py", line 35, in start
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] with mlflow.start_run(run_name = experiment_name):
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 350, in start_run
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] active_run_obj = client.create_run(
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mlflow/tracking/client.py", line 275, in create_run
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] return self._tracking_client.create_run(experiment_id, start_time, tags, run_name)
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 131, in create_run
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] return self.store.create_run(
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 175, in create_run
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] response_proto = self._call_endpoint(CreateRun, req_body)
2023-04-06 07:50:54.225 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 56, in _call_endpoint
2023-04-06 07:50:54.226 [1680767451283182/start/1 (pid 2012)] return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
2023-04-06 07:50:54.226 [1680767451283182/start/1 (pid 2012)] File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mlflow/utils/databricks_utils.py", line 413, in get_databricks_host_creds
2023-04-06 07:50:54.226 [1680767451283182/start/1 (pid 2012)] config = provider.get_config()
2023-04-06 07:50:54.226 [1680767451283182/start/1 (pid 2012)] File "/databricks/python/lib/python3.9/site-packages/databricks_cli/configure/provider.py", line 134, in get_config
2023-04-06 07:50:54.226 [1680767451283182/start/1 (pid 2012)] raise InvalidConfigurationError.for_profile(None)
2023-04-06 07:50:54.226 [1680767451283182/start/1 (pid 2012)] databricks_cli.utils.InvalidConfigurationError: You haven't configured the CLI yet! Please configure by entering `/dbfs/FileStore/xxx/metaflow_mlflow_workflow.py configure`
2023-04-06 07:50:54.226 [1680767451283182/start/1 (pid 2012)] 
2023-04-06 07:50:54.226 [1680767451283182/start/1 (pid 2012)] Task failed.
2023-04-06 07:50:54.227 Workflow failed.
2023-04-06 07:50:54.227 Terminating 0 active tasks...
2023-04-06 07:50:54.227 Flushing logs...
    Step failure:
    Step start (task-id 1) failed.

Appartently, I am doing something wrong. How is it possible to combine Metaflow and MLflow so that it is running in a Databricks-Notebook cell?


Solution

  • MLflow is getting imported in both versions. But it seems that when you create a run MLFlow is not configured to run with Databricks.

    Did you configure Databricks before running the second flow?

    If not this guide may be helpful.

    See the source of the MLFlow function that is causing the error here.