I would like to register a model to mlflow on Databricks from outside of the Databricks environment.
I have seen the following URL but was looking for some more detailed steps:
First inside a terminal I have run:
pip3 install mlflow
# I would like to use sklearn so install it
pip3 install sklearn
I then created the following variables:
export MLFLOW_TRACKING_URI=databricks
export DATABRICKS_HOST="https://mydatabricks-host"
# this token was created in the Databricks UI
# User Settings >> Developer >> Access Token
export DATABRICKS_TOKEN="mytoken"
I found the sklearn example and saved it to a file myscript.py
from pprint import pprint
import numpy as np
from sklearn.linear_model import LinearRegression
import mlflow
from mlflow import MlflowClient
def fetch_logged_data(run_id):
client = MlflowClient()
data = client.get_run(run_id).data
tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
return data.params, data.metrics, tags, artifacts
# enable autologging
mlflow.sklearn.autolog()
# prepare training data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
# train a model
model = LinearRegression()
with mlflow.start_run() as run:
model.fit(X, y)
# fetch logged data
params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)
pprint(params)
pprint(metrics)
pprint(tags)
pprint(artifacts)
When I run this in the terminal with:
python3 myscript.py
Note that the script needs to run in the same terminal session where you set the variables for MLFLOW_TRACKING_URI
, DATABRICKS_HOST
and DATABRICKS_TOKEN
.
This returned an error:
mlflow.exceptions.RestException: RESOURCE_DOES_NOT_EXIST: No experiment was found. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("experiment_name") at the start of your program.
So I added some code to create the experiment:
from pprint import pprint
import numpy as np
from sklearn.linear_model import LinearRegression
import mlflow
from mlflow import MlflowClient
########################################
# code to create the experiment
########################################
experiment_name = "my_experiment"
my_username = "me@emailaddress.com"
experiment_path = f"/Users/{my_username}/{experiment_name}"
try:
mlflow.create_experiment(experiment_path)
except:
# ignore if experiment already exists
pass
mlflow.set_experiment(experiment_path)
########################################
def fetch_logged_data(run_id):
client = MlflowClient()
data = client.get_run(run_id).data
tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
return data.params, data.metrics, tags, artifacts
# enable autologging
mlflow.sklearn.autolog()
# prepare training data
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
# train a model
model = LinearRegression()
with mlflow.start_run() as run:
model.fit(X, y)
# fetch logged data
params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)
pprint(params)
pprint(metrics)
pprint(tags)
pprint(artifacts)
This worked as expected:
{'copy_X': 'True',
'fit_intercept': 'True',
'n_jobs': 'None',
'positive': 'False'}
{'training_mean_absolute_error': 2.220446049250313e-16,
'training_mean_squared_error': 1.9721522630525295e-31,
'training_r2_score': 1.0,
'training_root_mean_squared_error': 4.440892098500626e-16,
'training_score': 1.0}
{'estimator_class': 'sklearn.linear_model._base.LinearRegression',
'estimator_name': 'LinearRegression'}
['model/MLmodel',
'model/conda.yaml',
'model/model.pkl',
'model/python_env.yaml',
'model/requirements.txt']