Search code examples
databricksdelta-laketransaction-log

Understanding delta lake reconciliation after concurrent writes


This doc explains atomicity in delta lake using trasaction logs. I am curious about the section in the image below.

enter image description here

Specifically point number 4

It checks to see whether any new commits have been made to the table, and updates the table silently to reflect those changes, then simply retries User 2’s commit on the newly updated table (without any data processing), successfully committing 000002.json.

As an example

  • User 1 increments column count of row id 1 by 100. (initial value 0)
  • User 2 increments column count of row id 1 by 2.
  • The above 2 happen concurrently.

What would be final value of the column count of row with id 1? And how?


Solution

  • The provided documentation is true only when you do appends to the table, so you can simply retry writing of the transaction log. But in your case, you have a concurrent update of the existing data, and it's handled differently, so concurrent updates will conflict (and throw an exception), and one of the updates will be rejected, but the order isn't deterministic and will depend on how fast each client attempts to write its version of the data. This case is described in details in the following documentation.

    Please note that conflict may happen not only when you update the same row, but even if you update different rows - you may avoid conflicts by using partitioning, but it's still depends on the update logic, etc.