Search code examples
delta-lakedata-lakeapache-hudilakefsdata-lakehouse

lakeFS, Hudi, Delta Lake merge and merge conflicts


I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS.

Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support and for this purpose would like to use lakeFS together with Hudi.

If I understand everything correctly, lakeFS is a data agnostic solution and knows nothing about the data itself. lakeFS only establishes boundaries (version control) and moderates somehow the concurent access to the data..

So the reasonable question is - if lakeFS is data agnostic, how it supports merge operation? What merge itself means in terms of lakeFS? And is it possible to have a merge conflict there?


Solution

  • You do understand everything correctly. You could see in the branching model page that lakeFS is currently data agnostic and relies simply on the hierarchical directory structure. A conflict would occur when two branches update the same file. This behavior fits most data engineers CI/CD use cases.

    In case you are working with Delta Lake and made changes to the same table from two different branches, there will still be a conflict because the two branches changed the log file. In order to resolve the conflict you would need to forgo one of the change sets. Admittedly this is not the best user experience and it's currently being worked on. You could read more about it on the roadmap documentation.