Search code examples
pysparkschemaazure-databricksdelta-lake

Schema Evaluation with overwrite in Delta tables


I am trying to include schema change for new type and dropped column in Delta tables.As per documentation

As per documentation:


df3
  .write
  .format("delta")
  .mode("overwrite")
  .option("mergeSchema", "true")
  .save(deltapath)

This way I loose old data. Is there anyway to keep olddata before overwriting with new schema apart from taking backup. I have posted same query in databricks forum,link for your reference: https://community.databricks.com/s/question/0D58Y00009WjZllSAF/what-is-the-best-way-to-take-care-of-drop-and-rename-a-column-in-schema-evaluation

One way is to create a backup of file before overwriting it.But I want to know what is best way.


Solution

  • The command you shared won't delete the existing data and it should still be queryable.

    For example:

    1. Create a table with 5 rows

      spark.range(5).write.format("delta").save("/tmp/table")

    2. Overwrite with a new column & mergeSchema option

      spark.range(5)
      .withColumn("data", lit("data"))
      .write.format("delta")
      .mode("overwrite")
      .option("mergeSchema","true")
      .save("/tmp/table")
      
    3. Query original data

      SELECT * FROM delta.`/tmp/table` VERSION AS OF 0

      +---+
      | id|
      +---+
      |  0|
      |  1|
      |  2|
      |  3|
      |  4|
      +---+
      
    4. Query new/current state

      SELECT * FROM delta.`/tmp/table` VERSION AS OF 1

      +---+----+
      | id|data|
      +---+----+
      |  0|data|
      |  1|data|
      |  2|data|
      |  3|data|
      |  4|data|
      +---+----+