Search code examples
apache-sparkapache-kafkadata-warehousedatabricksdelta-lake

Apache Spark + Delta Lake concepts


I have many doubts related to Spark + Delta. enter image description here

1) Databricks propose 3 layers (bronze, silver, gold), but in which layer is recommendable to use for Machine Learning and why? I suppose they propose to have the data clean and ready in the gold layer.

2) If we abstract the concepts of these 3 layers, can we think the bronze layer as a Data Lake, the silver layer as databases, and the gold layer as a data warehouse? I mean in terms of functionality, .

3) Delta architecture is a commercial term, or is an evolution of Kappa Architecture, or is a new trending architecture as Lambda and Kappa architecture? What are the differences between (Delta + Lambda Architecture) versus Kappa Architecture?

4) In many cases Delta + Spark scale a lot more than most databases for usually much cheaper, and if we tune things right, we can get almost 2x faster queries results. I know is pretty complicated to compare the actual trending data warehouses versus the Feature/Agg Data Store, but I would like to know how can I make this comparison?

5) I used to use Kafka, Kinesis, or Event Hub for streaming process, and my question is what kind of problems can happens if we replace these tools by a Delta Lake table (I already know that everything depends of many things, but I would like to have a general vision of that).


Solution

  • 1) Leave it up to your data scientists. They should be comfortable working in the silver and gold regions, some more advanced data scientists will want to go back to raw data and parse out additional information that may not have been included in the silver/gold tables.

    2) Bronze = raw data in native format/delta lake format. Silver = sanitized and cleaned data in delta lake. Gold = data that is accessed via the delta lake or pushed to a data warehouse, depending on business requirements.

    3) Delta architecture is an easy version of lambda architecture. Delta architecture is a commercial term at this point, we'll see if that changes in the future.

    4) Delta Lake + Spark is the most scalable data storage mechanism with a reasonable price. You're welcome to test the performance based on your business requirements. Delta lake will be far cheaper than any data warehouse for storage. Your requirements around data access and latency will be the larger question.

    5) Kafka, Kinesis or Eventhub are sources for getting data from the edge to the data lake. Delta lake can act as a source and sink to a streaming application. There are actually very few problems using delta as a source. The delta lake source lives on blob storage so we actually get around many problems of the infrastructure issues, but add the consistentcy issues of the blob storage. Delta lake as a source of streaming jobs is way more scalable than a kafka/kinesis/event hub, but you still need those tools to get data from the edge into the delta lake.