Search code examples
apache-nifiazure-data-factorydatabricksdata-ingestion

Databricks Ingest use cases


I've just found a new Databricks feature called Databricks Data Ingestion. There is very little material about it at this point.

When I should use Databricks Data Ingestion instead of existing mature tools like Azure Data Factory (ADF) or Apache Nifi?

Both ADF and Nifi can ingest into ADLS/S3, and AFAIK ADLS/S3 can be mapped to Databricks DBFS without copying any data, and parquet files can be easily converted into Delta format. So what is the benefit or use cases for using new tool? What I am missing?


Solution

  • There are three items in the blog post.

    1. Auto Loader
    2. COPY INTO
    3. Data Ingestion from 3rd party sources

    Auto Loader and COPY INTO simplify state management of your data ingest pipeline. What I mean by state management is the management of what files or events have been ingested and processed. With Nifi, Airflow, ADF, you need a separate state store to track which files have been ingested or not. ETL systems often 'move' ingested files to another folder. This is still state management. Others might track the file in a database or a no-sql data store.

    Before Auto Loader or COPY INTO, you would have to: 1. Detect files in landing zone 2. Compare file with files already ingested. 2. Present the file for processing 3. Track which file you've ingested.

    If these steps get messed up, fall behind, then a file might be ingested and processed twice or lost. Moving files has a cost in complexity.

    With Auto Loader or COPY INTO, in one statement, you can setup a streaming or incremental data ingest. Set an archive policy on the landing zone for 7 days or 48 hours, your landing zone clears itself automatically. Your code & architecture is greatly simplified.

    Auto Loader (for streams) and COPY INTO (for re-curing batch jobs) utilize Databricks under the covers to track and manage the state for you. For Auto Loader, Databricks will setup the infrastructure, SNS, SQS that greatly reduces the latency of your streaming data ingest.

    The Third item in the blog post, is an announcement of partnerships with well established data ingest companies that have a wide array of out of the box connectors to your enterprise data. These companies work with Delta Lake.

    You would still use Nifi, Airflow, Stream Sets etc. to acquire the data from source systems. Those tools would now only trigger the 'COPY INTO' command as needed for batch / micro batch ingest. Auto Loader will just run continuously or run when triggered.