Search code examples
javascalaapache-sparkparquetdelta-lake

Using Spark 3.2 to ingest IoT data into delta lake continuously


It is possible to use org.apache.spark.sql.delta.sources.DeltaDataSource directly to ingest data continuously in append mode ?

Is there another more suitable approach? My concern is about latency and scalability since the data acquisition frequency can reach 30 KHz in each vibration sensor and there are several of them and I need to record the raw data in Delta Lake for FFT and Wavelet analysis, among others.

In my architecture the data ingestion is done continuously in a Spark application while the analyzes are performed in another independent Spark application with on-demand queries.

If there is no solution for Delta Lake, a solution for Apache Parquet would work because it will be possible to create Datasets in Delta Lake from data stored in Parquet Datasets.


Solution

  • Yes, it's possible and it works well. There are several advantages of Delta for streaming architecture:

    • you don't have a "small files problem" that often arises with streaming workloads - you don't need to list all data files to find new files (as in case of Parquet or other data source) - all data is recorded in the transaction log
    • your consumers don't see partial writes because Delta provides transactional capabilities
    • streaming workloads are natively supported by Delta
    • you can perform DELETE/UPDATE/MERGE even for streaming workloads - it's impossible with Parquet

    P.S. you can just use .format("delta") instead of full class name