Search code examples
architecturebigdataetlevent-sourcingdata-lake

Exceptions from Data Lake immutability rule


Data Lake should be immutable:

It is important that all data put in the lake should have a clear provenance in place and time. Every data item should have a clear trace to what system it came from and when the data was produced. The data lake thus contains a historical record. This might come from feeding Domain Events into the lake, a natural fit with Event Sourced systems. But it could also come from systems doing a regular dump of current state into the lake - an approach that's valuable when the source system doesn't have any temporal capabilities but you want a temporal analysis of its data. A consequence of this is that data put into the lake is immutable, an observation once stated cannot be removed (although it may be refuted later), you should also expect ContradictoryObservations.

Are there any expceptions from rule, where it may be considered a good practice to overwrite data in Data Lake? I suppose no, but some team mates have different understanding.

I think that data provenance and tracebility is needed in case of cummulative algorithm, to be able to reproduce the final state. What if final state isn't dependent on previous results? Is somebody right if he says that Data Lake immutability (event sourcing) in Data Lake are needed only for cummulative algorithms?

For example, you have a full-load daily-basis ingestion of tables A and B, afterwards calculate table C. If user is interested only in the latest result of C, are there any reasons to keep history (event sourcing based on date partitioning) of A, B and C?

Another concern may be an ACID compliance - you may have your file corrupted or partially written. But suppose we're discussing the case when the latest state of A and B can be easily restored from source systems.


Solution

  • Are there any expceptions from rule, where it may be considered a good practice to overwrite data in Data Lake?

    The good practice is not overwrite data in the data lake. In case some event was generated with error or bug. New events that compensates the previous one should be produced. That way, the Datalake records all the events history, including compensatory events and eventual reprocessings.

    I think that data provenance and tracebility is needed in case of cummulative algorithm, to be able to reproduce the final state. What if final state isn't dependent on previous results? Is somebody right if he says that Data Lake immutability (event sourcing) in Data Lake are needed only for cummulative algorithms?

    The DataLake is the final destiny for all relevant events. Not all events need to be recorded in the Data Lake. Usually, we distinguish between operational/communication and business events. The business events recorded in the DataLake can be used for reprocessing or in new features that depend on the event's history. Isolated events that do not depend on the event's history can also be produced and added to the history. Consequently, we can deduce that the final state does not violate the principle of immutability. For a set of immutable events contiguous in time, we can always produce a final state. So, the answer is not only for cumulative algorithms.

    For example, you have a full-load daily-basis ingestion of tables A and B, afterwards calculate table C. If user is interested only in the latest result of C, are there any reasons to keep history (event sourcing based on date partitioning) of A, B and C?

    The starting event for an event history cannot be reproduced. Only after the first event, we can think about the final state. In this particular case, the A and B tuples and aggregations should not be considered events. But the calculation function input. The calculation function input should be recorded in the data lake as a business event. The event X (calculation input) at the end produces the event Y. In case the event X doesn't be recorded in the event's history, Y it should be considered the starting event.