Search code examples
kedro

Kedro deployment to databricks


Maybe I misunderstand the purpose of packaging but it doesn't seem to helpful in creating an artifact for production deployment because it only packages code. It leaves out the conf, data, and other directories that make the kedro project reproducible.

I understand that I can use docker or airflow plugins for deployment but what about deploying to databricks. Do you have any advice here?

I was thinking about making a wheel that could be installed on the cluster but I would need to package the conf first. Another option is to just sync a git workspace to the cluster and run kedro via a notebook.

Any thoughts on a best practice?


Solution

  • If you are not using docker and just using kedro to deploy directly on a databricks cluster. This is how we have been deploying kedro to databricks.

    1. CI/CD pipeline builds using kedro package. Creates a wheel file.

    2. Upload dist and conf to dbfs or AzureBlob file copy (if using Azure Databricks)

    This will upload everything to databricks on every git push

    Then you can have a notebook with the following:

    1. You can have an init script in databricks something like:
    from cargoai import run
    from cargoai.pipeline import create_pipeline
    
    branch = dbutils.widgets.get("branch")
    
    conf = run.get_config(
        project_path=f"/dbfs/project_name/build/cicd/{branch}"
    )
    catalog = run.create_catalog(config=conf)
    pipeline = create_pipeline()
    
    

    Here conf, catalog, and pipeline will be available

    1. Call this init script when you want to run a branch or a master branch in production like:
      %run "/Projects/InitialSetup/load_pipeline" $branch="master"

    2. For development and testing, you can run specific nodes
      pipeline = pipeline.only_nodes_with_tags(*tags)

    3. Then run a full or a partial pipeline with just SequentialRunner().run(pipeline, catalog)

    In production, this notebook can be scheduled by databricks. If you are on Azure Databricks, you can use Azure Data Factory to schedule and run this.