azure azure-data-factory azure-data-lake databricks

Azure Databricks to Event Hub

I am very new to Databricks. So, pardon me please. Here is my requiremnt

I have data stored in Azure DataLake
As per the requirement, we can only access data via Azure Databricks notebook
We have to pull the data from certain tables, join with other tables, aggregate
Send the data to an Event Hub

How can I perform this activity. I assume there is not one shot process. I was planning to create a notebook and run it via Azure Data Factory. Pump the data in Blob and then using .Net send it to Event Hub. But, from Azure Data Factory we can only run the Azure Databricks notebook not store anywhere

Solution

Azure Databricks do support Azure Event Hubs as source and sink. Understand Structured Streaming - it is a stream processing engine in Apache Spark (available in Azure Databricks as well)

Create a notebook to do all your transformation (join, aggregation...) - assuming you are doing batch write to azure event hub.

Using Scala

val connectionString = "Valid EventHubs connection string."
val ehWriteConf = EventHubsConf(connectionString)
df.select("body")
.write
.format("eventhubs")
.options(ehWriteConf.toMap)    
.save()

Replace .write to .writeStream if your queries are streaming.

Using PySpark

ds = df \
  .select("body") \
  .writeStream \
  .format("eventhubs") \
  .options(**ehWriteConf) \
  .option("checkpointLocation", "///output.txt") \
  .start()

More things to consider when working with Azure Event Hubs is regarding partitions - it is optional, you can just send the body alone (which will do round robin model)

More information here
And the PySpark version here