Search code examples
azureazure-data-factoryazure-data-lakedatabricks

Azure Databricks to Event Hub


I am very new to Databricks. So, pardon me please. Here is my requiremnt

  1. I have data stored in Azure DataLake
  2. As per the requirement, we can only access data via Azure Databricks notebook
  3. We have to pull the data from certain tables, join with other tables, aggregate
  4. Send the data to an Event Hub

How can I perform this activity. I assume there is not one shot process. I was planning to create a notebook and run it via Azure Data Factory. Pump the data in Blob and then using .Net send it to Event Hub. But, from Azure Data Factory we can only run the Azure Databricks notebook not store anywhere


Solution

  • Azure Databricks do support Azure Event Hubs as source and sink. Understand Structured Streaming - it is a stream processing engine in Apache Spark (available in Azure Databricks as well)

    Create a notebook to do all your transformation (join, aggregation...) - assuming you are doing batch write to azure event hub.

    Using Scala

    val connectionString = "Valid EventHubs connection string."
    val ehWriteConf = EventHubsConf(connectionString)
    df.select("body")
    .write
    .format("eventhubs")
    .options(ehWriteConf.toMap)    
    .save()
    

    Replace .write to .writeStream if your queries are streaming.

    Using PySpark

    ds = df \
      .select("body") \
      .writeStream \
      .format("eventhubs") \
      .options(**ehWriteConf) \
      .option("checkpointLocation", "///output.txt") \
      .start()
    

    More things to consider when working with Azure Event Hubs is regarding partitions - it is optional, you can just send the body alone (which will do round robin model)

    More information here
    And the PySpark version here