Maybe I misunderstand the purpose of packaging but it doesn't seem to helpful in creating an artifact for production deployment because it only packages code. It leaves out the conf, data, and other directories that make the kedro project reproducible.
I understand that I can use docker or airflow plugins for deployment but what about deploying to databricks. Do you have any advice here?
I was thinking about making a wheel that could be installed on the cluster but I would need to package the conf first. Another option is to just sync a git workspace to the cluster and run kedro via a notebook.
Any thoughts on a best practice?
If you are not using docker
and just using kedro to deploy directly on a databricks cluster. This is how we have been deploying kedro to databricks.
CI/CD pipeline builds using kedro package
. Creates a wheel file.
Upload dist
and conf
to dbfs or AzureBlob file copy (if using Azure Databricks)
This will upload everything to databricks on every git push
Then you can have a notebook with the following:
from cargoai import run
from cargoai.pipeline import create_pipeline
branch = dbutils.widgets.get("branch")
conf = run.get_config(
project_path=f"/dbfs/project_name/build/cicd/{branch}"
)
catalog = run.create_catalog(config=conf)
pipeline = create_pipeline()
Here conf
, catalog
, and pipeline
will be available
Call this init script when you want to run a branch or a master
branch in production like: %run "/Projects/InitialSetup/load_pipeline" $branch="master"
For development and testing, you can run specific nodespipeline = pipeline.only_nodes_with_tags(*tags)
Then run a full or a partial pipeline with just SequentialRunner().run(pipeline, catalog)
In production, this notebook can be scheduled by databricks. If you are on Azure Databricks, you can use Azure Data Factory
to schedule and run this.