how to register a model in Databricks mlflow registry from outside of the Databricks environment?

I would like to register a model to mlflow on Databricks from outside of the Databricks environment.

I have seen the following URL but was looking for some more detailed steps:

https://docs.databricks.com/en/mlflow/access-hosted-tracking-server.html

First inside a terminal I have run:

pip3 install mlflow

# I would like to use sklearn so install it
pip3 install sklearn

I then created the following variables:

export MLFLOW_TRACKING_URI=databricks
export DATABRICKS_HOST="https://mydatabricks-host"

# this token was created in the Databricks UI
# User Settings >> Developer >> Access Token
export DATABRICKS_TOKEN="mytoken"

Solution

I found the sklearn example and saved it to a file myscript.py

from pprint import pprint
import numpy as np
from sklearn.linear_model import LinearRegression
import mlflow
from mlflow import MlflowClient

def fetch_logged_data(run_id):
    client = MlflowClient()
    data = client.get_run(run_id).data
    tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
    artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
    return data.params, data.metrics, tags, artifacts

# enable autologging
mlflow.sklearn.autolog()

# prepare training data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3

# train a model
model = LinearRegression()
with mlflow.start_run() as run:
    model.fit(X, y)

# fetch logged data
params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)

pprint(params)
pprint(metrics)
pprint(tags)
pprint(artifacts)

When I run this in the terminal with:

python3 myscript.py

Note that the script needs to run in the same terminal session where you set the variables for MLFLOW_TRACKING_URI, DATABRICKS_HOST and DATABRICKS_TOKEN.

This returned an error:

mlflow.exceptions.RestException: RESOURCE_DOES_NOT_EXIST: No experiment was found. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("experiment_name") at the start of your program.

So I added some code to create the experiment:

from pprint import pprint
import numpy as np
from sklearn.linear_model import LinearRegression
import mlflow
from mlflow import MlflowClient

########################################
# code to create the experiment
########################################

experiment_name = "my_experiment"
my_username = "me@emailaddress.com"

experiment_path = f"/Users/{my_username}/{experiment_name}"

try:
    mlflow.create_experiment(experiment_path)
except:
    # ignore if experiment already exists
    pass

mlflow.set_experiment(experiment_path)
########################################

def fetch_logged_data(run_id):
    client = MlflowClient()
    data = client.get_run(run_id).data
    tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
    artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
    return data.params, data.metrics, tags, artifacts


# enable autologging
mlflow.sklearn.autolog()

# prepare training data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3

# train a model
model = LinearRegression()
with mlflow.start_run() as run:
    model.fit(X, y)

# fetch logged data
params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)

pprint(params)
pprint(metrics)
pprint(tags)
pprint(artifacts)

This worked as expected:

{'copy_X': 'True',
 'fit_intercept': 'True',
 'n_jobs': 'None',
 'positive': 'False'}
{'training_mean_absolute_error': 2.220446049250313e-16,
 'training_mean_squared_error': 1.9721522630525295e-31,
 'training_r2_score': 1.0,
 'training_root_mean_squared_error': 4.440892098500626e-16,
 'training_score': 1.0}
{'estimator_class': 'sklearn.linear_model._base.LinearRegression',
 'estimator_name': 'LinearRegression'}
['model/MLmodel',
 'model/conda.yaml',
 'model/model.pkl',
 'model/python_env.yaml',
 'model/requirements.txt']