Search code examples
databricksmlflow

how to register a model in Databricks mlflow registry from outside of the Databricks environment?


I would like to register a model to mlflow on Databricks from outside of the Databricks environment.

I have seen the following URL but was looking for some more detailed steps:

First inside a terminal I have run:

pip3 install mlflow

# I would like to use sklearn so install it
pip3 install sklearn

I then created the following variables:

export MLFLOW_TRACKING_URI=databricks
export DATABRICKS_HOST="https://mydatabricks-host"

# this token was created in the Databricks UI
# User Settings >> Developer >> Access Token
export DATABRICKS_TOKEN="mytoken" 

Solution

  • I found the sklearn example and saved it to a file myscript.py

    from pprint import pprint
    import numpy as np
    from sklearn.linear_model import LinearRegression
    import mlflow
    from mlflow import MlflowClient
    
    def fetch_logged_data(run_id):
        client = MlflowClient()
        data = client.get_run(run_id).data
        tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
        artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
        return data.params, data.metrics, tags, artifacts
    
    # enable autologging
    mlflow.sklearn.autolog()
    
    # prepare training data
    X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
    y = np.dot(X, np.array([1, 2])) + 3
    
    # train a model
    model = LinearRegression()
    with mlflow.start_run() as run:
        model.fit(X, y)
    
    # fetch logged data
    params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)
    
    pprint(params)
    pprint(metrics)
    pprint(tags)
    pprint(artifacts)
    

    When I run this in the terminal with:

    python3 myscript.py

    Note that the script needs to run in the same terminal session where you set the variables for MLFLOW_TRACKING_URI, DATABRICKS_HOST and DATABRICKS_TOKEN.

    This returned an error:

    mlflow.exceptions.RestException: RESOURCE_DOES_NOT_EXIST: No experiment was found. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("experiment_name") at the start of your program.
    

    So I added some code to create the experiment:

    from pprint import pprint
    import numpy as np
    from sklearn.linear_model import LinearRegression
    import mlflow
    from mlflow import MlflowClient
    
    ########################################
    # code to create the experiment
    ########################################
    
    experiment_name = "my_experiment"
    my_username = "me@emailaddress.com"
    
    experiment_path = f"/Users/{my_username}/{experiment_name}"
    
    try:
        mlflow.create_experiment(experiment_path)
    except:
        # ignore if experiment already exists
        pass
    
    mlflow.set_experiment(experiment_path)
    ########################################
    
    def fetch_logged_data(run_id):
        client = MlflowClient()
        data = client.get_run(run_id).data
        tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
        artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
        return data.params, data.metrics, tags, artifacts
    
    
    # enable autologging
    mlflow.sklearn.autolog()
    
    # prepare training data
    X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
    y = np.dot(X, np.array([1, 2])) + 3
    
    # train a model
    model = LinearRegression()
    with mlflow.start_run() as run:
        model.fit(X, y)
    
    # fetch logged data
    params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)
    
    pprint(params)
    pprint(metrics)
    pprint(tags)
    pprint(artifacts)
    

    This worked as expected:

    {'copy_X': 'True',
     'fit_intercept': 'True',
     'n_jobs': 'None',
     'positive': 'False'}
    {'training_mean_absolute_error': 2.220446049250313e-16,
     'training_mean_squared_error': 1.9721522630525295e-31,
     'training_r2_score': 1.0,
     'training_root_mean_squared_error': 4.440892098500626e-16,
     'training_score': 1.0}
    {'estimator_class': 'sklearn.linear_model._base.LinearRegression',
     'estimator_name': 'LinearRegression'}
    ['model/MLmodel',
     'model/conda.yaml',
     'model/model.pkl',
     'model/python_env.yaml',
     'model/requirements.txt']