Search code examples
machine-learningdatabricksworkflowmlflow

How can I start using MLFlow on databricks with an existing trained model?


I have an existing model that was trained on Azure. I want to fully integrate and start using the model on Databricks. Whats the best way to do this? How can I successfully load the model into databricks model workflow? I have the model in a pickle file

I have read almost all the documentation on databricks, but 99% of it is regarding new models trained on databricks and never about importing existing models.


Solution

  • Since MLFlow has a standardized model storage format, you just need to bring over the model files and start using them with the MLFlow package. In addition, you can register the model to the workspace's model registry using mlflow.register_model() and then use it from there. These would be the steps:

    1. On the AzureML side, I assume that you have an MLFlow model saved to disk (using mlflow.sklearn.save_model() or mlflow.sklearn.autolog -- or some other mlflow.<flavor>). That should give you a folder that contains an MLModel file, and, depending on the flavor of the model a few more files -- like the below:
    mlflow-model
    ├── MLmodel
    ├── conda.yaml
    ├── model.pkl
    └── requirements.txt
    

    Note: You can download the model from the AzureML Workspace using the v2 CLI like so: az ml model download --name <model_name> --version <model_version>

    1. Open a Databricks Notebook and make sure it has mlflow installed
    %pip install mlflow
    
    1. Upload the MLFlow model files to the dbfs connected to the cluster enter image description here

    2. In the Notebook, register the model using MLFlow (adjust the dbfs: path to the location where the model was uploaded to).

    import mlflow
    
    model_version = mlflow.register_model("dbfs:/FileStore/shared_uploads/mlflow-model/", "AzureMLModel")
    
    

    Now your model is registered in the Workspace's model registry like any model that was created from a Databricks session. So, you can access it from the registry like so:

    model = mlflow.pyfunc.load_model(f"models:/AzureMLModel/{model_version.version}")
    
    input_example = {
       "sepal_length": [5.1,4.8],
       "sepal_width": [3.5,4.4],
       "petal_length": [1.4,2.0],
       "petal_width": [0.2,0.1]
     }
    model.predict(input_example)
    

    enter image description here

    Or use the model as a spark_udf:

    import pandas as pd
    model_udf = mlflow.pyfunc.spark_udf(spark=spark, model_uri=f"models:/AzureMLModel/{model_version.version}", result_type='string' )
    spark_df = spark.createDataFrame(pd.DataFrame(input_example))
    spark_df = spark_df.withColumn('foo', model_udf())
    display(spark_df)
    

    enter image description here

    Note that I am using mlflow.pyfunc to load the model since every MLFlow model needs to support the pyfunc flavor. That way, you don't need to worry about the native flavor of the model.