Search code examples
pysparkazure-eventhubazure-databricks

Event hub to databricks error as stream is terminated?


I've been trying to set up a proof of concept were Azure Databricks reads data from my Event Hub using the following code:

connectionString = "Endpoint=sb://mystream.servicebus.windows.net/;EntityPath=theeventhub;SharedAccessKeyName=event_test;SharedAccessKey=mykeygoeshere12345"

ehConf = {
  'eventhubs.connectionString' : connectionString
}

df = spark \
  .readStream \
  .format("eventhubs") \
  .options(**ehConf) \
  .load()

readEventStream = df.withColumn("body", df["body"].cast("string"))
display(readEventStream)

I'm using the azure_eventhubs_spark_2_11_2_3_6.jar package as recommeneded here but i've tried the latest version and keep getting the message

ERROR : Some streams terminated before this command could finish!

I've used the databricks runtime version 6.1, and rolled it back to 5.3 but can't seem to get it up and running. I have a Python script that sends data to the event hub, I just can't see anything coming out of it? Is it the package? or something else I'm doing wrong?

Update: I was loading the library from a JAR file that I downloaded. I deleted that and then got it from the Maven repo. After testing it worked


Solution

  • It works perfectly with the below configuration:

    Databrick Runtime: 5.5 LTS (includes Apache Spark 2.4.3, Scala 2.11)

    Azure EventHub library: com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.13

    Use above configuration, able to get stream the data from Azure Eventhubs.

    enter image description here

    Reference: Integrating Apache Spark with Azure Event Hubs

    Hope this helps.