Search code examples
azuredatabricksazure-databricksazure-eventhubdelta-live-tables

Azure Event Hub to ensure read data only once with failure handling


Hey folks I am working on use case where I am implementing updated/incremental updates to delta tables through event hubs in Azure cloud. I came across event hubs and delta live tables which would be necessary. I have an HVR agent at start which will fetch continues data from various data sources. The event hub will read the data and land the data to the delta live tables and further to the delta tables which will act as source to pipelines.

Below are the scenarios which are to be covered.

  1. To read the newly landed data only once though there might be server down issues.
  2. In case of any failure, we should read data from last point of success state
  3. Recover the past data from initial

Could you please help me out to resolve my scenarios.


Solution

  • Yes, Delta Live Tables (DLT) will fulfill that requirements. For streaming live tables DLT uses under the hood Spark Structured Streaming that guarantees:

    • When everything works fine, data will be read once. Structured streaming tracks consumed offsets in the checkpoint (but this happens automatically in DLT)
    • In case of failure during data processing, DLT will process data starting with offsets stored in checkpoint during the last successful processing.

    3rd requirement isn't very clear - is it about to consume data from the begin of the topic? Then yes, it's possible.

    Please note that you can't use EventHubs Spark connector directly as DLT right now doesn't allow installation of the external jars, but you can do it using built-in Kafka connector that is part of DLT runtime. This answer shows how to do that.