Search code examples
architecturebigdatadatabricksdelta-lakedata-lakehouse

Upserts on Delta simply duplicates data?


I'm fairly new with Delta and lakehouse on databricks. I have some questions, based on the following actions:

  • I import some parquet files
  • Convert them to delta (creating 1 snappy.parquet file)
  • Delete one random row (creating 1 new snappy.parquet file).
  • I check content of both snappy files (version 0 of delta table, and version1), and they both contain all of the data, each one with it's specific differences.

Does this mean delta simply duplicates data for every new version?

How is this scalable? or am I missing something?


Solution

  • Yes, that's how Delta lake works - when you're doing modification of the data, it won't write only delta, but takes the original file that is affected by change, make changes, and write it back. But take into account that not all data is duplicated - only that were in the file where affected rows are. For example, you have 3 data files, and you're making changes to some rows that are in the 2nd file. In this case, Delta will create a new file with number 4 that contains necessary changes + the rest of data from file 2, so you will have following versions:

    • Version 0: files 1, 2 & 3
    • Version 1: files, 1, 3 & 4