Search code examples
pysparkdatabrickskedro

How would one use databricks delta lake format with Kedro?


We are using kedro in our project. Normally, one can define datasets as such:

client_table:
  type: spark.SparkDataSet
  filepath: ${base_path_spark}/${env}/client_table
  file_format: parquet
  save_args:
    mode: overwrite

Now we're running on databricks and they offer many optimisations such as autoOptimizeShuffle. We are considering to make use of this to handle our 15TB+ datasets.

However, it's not clear to me how to use kedro with the databricks delta lake solution


Solution

  • Kedro now has a native dataset, see the docs here: https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction

    temperature:
      type: spark.SparkDataSet
      filepath: data/01_raw/data.csv
      file_format: "csv"
      load_args:
        header: True
        inferSchema: True
      save_args:
        sep: '|'
        header: True
    
    weather@spark:
      type: spark.SparkDataSet
      filepath: s3a://my_bucket/03_primary/weather
      file_format: "delta"
      save_args:
        mode: "overwrite"
        versionAsOf: 0
    
    weather@delta:
      type: spark.DeltaTableDataSet
      filepath: s3a://my_bucket/03_primary/weather