Search code examples
python-3.xmachine-learningdatabricksmlflownetflix-metaflow

Databricks MLFlow and MetaFlow integration


I am using Metaflow to orchestrate the training pipeline of a machine learning model and the scope is to combine Metaflow with the Databricks MLflow for monitoring of ML. The Metaflow pipeline is pasted below with the response and the Metaflow validation. The MLFLOW_TRACKING_URI is set to "databricks". The error at the end is saying: RuntimeError: Failed to connect to MLflow server databricks.

What am I missing in the configuration? I am using Databricks cluster with runtime version:16.0 ML (includes Apache Spark 3.5.0, Scala 2.12). What is the best way to integrate Databricks MLflow with Metaflow?

@step
    def start(self):
        """Start and prepare the Training pipeline."""
        import mlflow

        self.mlflow_tracking_uri = os.getenv("MLFLOW_TRACKING_URI")

        logging.info("MLFLOW_TRACKING_URI: %s", self.mlflow_tracking_uri)
        mlflow.set_tracking_uri(self.mlflow_tracking_uri)

        self.mode = "production" if current.is_production else "development"
        logging.info("Running flow in %s mode.", self.mode)
        logging.info("The metaflow id is %s ", current.run_id)

        self.data = self.load_dataset()

        try:
            # Let's start a new MLFlow run to track everything that happens during the
            # execution of this flow. We want to set the name of the MLFlow
            # experiment to the Metaflow run identifier so we can easily
            # recognize which experiment corresponds with each run.
            run = mlflow.start_run(run_name="current.run_id")
            self.mlflow_run_id = run.info.run_id
        except Exception as e:
            message = f"Failed to connect to MLflow server {self.mlflow_tracking_uri}."
            raise RuntimeError(message) from e

        # This is the configuration we'll use to train the model. We want to set it up
        # at this point so we can reuse it later throughout the flow.
        self.training_parameters = {
            "epochs": TRAINING_EPOCHS,
            "batch_size": TRAINING_BATCH_SIZE,
        }

        # Now that everything is set up, we want to run a cross-validation process
        # to evaluate the model and train a final model on the entire dataset. Since
        # these two steps are independent, we can run them in parallel.
        self.next(self.cross_validation, self.transform)




%sh
python3 ml.school/pipelines/training.py --environment=pypi run
Metaflow 2.12.39 executing Training for user:test
Project: penguins, Branch: test
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
2024-12-17 09:23:59.113 Bootstrapping virtual environment(s) ...
2024-12-17 09:23:59.691 Virtual environment(s) bootstrapped!
Including file ml.school/data/penguins.csv of size 13KB 
2024-12-17 09:24:03.265 Workflow starting (run-id 1734427440104542):
2024-12-17 09:24:13.977 [1734427440104542/start/1 (pid 10535)] Task is starting.
2024-12-17 09:24:19.944 [1734427440104542/start/1 (pid 10535)] 2024-12-17 09:24:19,944 [INFO] MLFLOW_TRACKING_URI: databricks
2024-12-17 09:24:19.944 [1734427440104542/start/1 (pid 10535)] 2024-12-17 09:24:19,944 [INFO] Running flow in development mode.
2024-12-17 09:24:20.068 [1734427440104542/start/1 (pid 10535)] 2024-12-17 09:24:19,944 [INFO] The metaflow id is 1734427440104542
2024-12-17 09:24:20.068 [1734427440104542/start/1 (pid 10535)] 2024-12-17 09:24:20,068 [INFO] Loaded dataset with 344 samples
2024-12-17 09:24:23.140 [1734427440104542/start/1 (pid 10535)] <flow Training step start> failed:
2024-12-17 09:24:29.517 [1734427440104542/start/1 (pid 10535)] Internal error
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] Traceback (most recent call last):
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] File "/Workspace/Users/test/ML-End-to-End/ml.school/pipelines/training.py", line 97, in start
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] run = mlflow.start_run(run_name="current.run_id")
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/tracking/fluent.py", line 418, in start_run
2024-12-17 09:24:29.525 [1734427440104542/start/1 (pid 10535)] active_run_obj = client.create_run(
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/tracking/client.py", line 393, in create_run
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] return self._tracking_client.create_run(experiment_id, start_time, tags, run_name)
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/tracking/_tracking_service/client.py", line 168, in create_run
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] return self.store.create_run(
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/store/tracking/rest_store.py", line 209, in create_run
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] response_proto = self._call_endpoint(CreateRun, req_body)
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/store/tracking/rest_store.py", line 82, in _call_endpoint
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/utils/rest_utils.py", line 370, in call_endpoint
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] response = verify_rest_response(response, endpoint)
2024-12-17 09:24:29.526 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/root/micromamba/envs/metaflow/linux-64/f1e54cd83bbd25f/lib/python3.12/site-packages/mlflow/utils/rest_utils.py", line 240, in verify_rest_response
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] raise RestException(json.loads(response.text))
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] mlflow.exceptions.RestException: RESOURCE_DOES_NOT_EXIST: No experiment was found. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("experiment_name") at the start of your program.
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] 
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] The above exception was the direct cause of the following exception:
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] 
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] Traceback (most recent call last):
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/cli.py", line 554, in main
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] start(auto_envvar_prefix="METAFLOW", obj=state)
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/core.py", line 829, in __call__
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] return self.main(args, kwargs)
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/core.py", line 782, in main
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] rv = self.invoke(ctx)
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/cli_components/utils.py", line 69, in invoke
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.527 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/core.py", line 1066, in invoke
2024-12-17 09:24:29.528 [1734427440104542/start/1 (pid 10535)] return ctx.invoke(self.callback, ctx.params)
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/core.py", line 610, in invoke
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] return callback(args, kwargs)
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/_vendor/click/decorators.py", line 21, in new_func
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] return f(get_current_context(), args, kwargs)
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/cli_components/step_cmd.py", line 178, in step
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] task.run_step(
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/task.py", line 653, in run_step
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] self._exec_step_function(step_func)
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/tmp/tmpked1dr6v/metaflow/task.py", line 62, in _exec_step_function
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] step_function()
2024-12-17 09:24:29.896 [1734427440104542/start/1 (pid 10535)] File "/Workspace/Users/test/ML-End-to-End/ml.school/pipelines/training.py", line 101, in start
2024-12-17 09:24:29.897 [1734427440104542/start/1 (pid 10535)] raise RuntimeError(message) from e
2024-12-17 09:24:29.897 [1734427440104542/start/1 (pid 10535)] RuntimeError: Failed to connect to MLflow server databricks.
2024-12-17 09:24:29.897 [1734427440104542/start/1 (pid 10535)] 
2024-12-17 09:24:30.134 [1734427440104542/start/1 (pid 10535)] Task failed.
2024-12-17 09:24:30.372 Workflow failed.
2024-12-17 09:24:30.372 Terminating 0 active tasks...
2024-12-17 09:24:30.372 Flushing logs...
    Step failure:
    Step start (task-id 1) failed.

Solution

  • Likely what the error logs say - you can try adding this snippet with the experiment name before starting the run

    mlflow.set_experiment("experiment_name")