Search code examples
pysparkdatabricksdelta-lakespark-notebook

Databricks Delta files adding new partition causes old ones to be not readable


I have a notebook using which i am doing a history load. Loading 6 months data everytime, starting with 2018-10-01. My delta file is partitioned by calendar_date

After the initial load i am able to read the delta file and look the data just fine.

But after the second load for date 2019-01-01 to 2019-06-30, the previous partitons are not loading normally using delta format.

Reading my source delta file like this throws me error saying

file dosen't exist

game_refined_start = (
    spark.read.format("delta").load("s3://game_events/refined/game_session_start/calendar_date=2018-10-04/")
)

However reading like below just works fine any idea what could be wrong

spark.conf.set("spark.databricks.delta.formatCheck.enabled", "false")
game_refined_start = (
    spark.read.format("parquet").load("s3://game_events/refined/game_session_start/calendar_date=2018-10-04/")
)

Solution

  • If the overwrite mode is used, then it completely replaces previous data. You see old data via parquet because delta doesn't remove old versions immediately (but if you do with parquet, then it will remove data immediately).

    To fix your problem - you need to use append mode. If you need to get previous data, you can read specific version from a table, and append it. Something like this:

    path = "s3://game_events/refined/game_session_start/"
    v = spark.sql(f"DESCRIBE HISTORY delta.`{path}` limit 2")
    version = v.take(2)[1][0]
    df = spark.read.format("delta").option("versionAsOf", version).load(path)
    df.write.mode("append").save(path)