Search code examples
apache-sparkapache-kafkaazure-databricksspark-structured-streamingazure-eventhub

Reading from Azure Event hub with Kafka driver doesn't seem to get any data


I'm running the following code in an Azure Databricks python notebook:

TOPIC = "myeventhub"
BOOTSTRAP_SERVERS = "myeventhubns.servicebus.windows.net:9093"
EH_SASL = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://myeventhubns.servicebus.windows.net/;SharedAccessKeyName=MyKeyName;SharedAccessKey=myaccesskey;\";"

df = spark.readStream \
    .format("kafka") \
    .option("subscribe", TOPIC) \
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.jaas.config", EH_SASL) \
    .option("kafka.request.timeout.ms", "60000") \
    .option("kafka.session.timeout.ms", "60000") \
    .option("failOnDataLoss", "false") \
    .option("startingOffsets", "earliest") \
    .load()

df_write = df.writeStream \
    .outputMode("append") \
    .format("console") \
    .start() \
    .awaitTermination()

This shows no output in the notebook. How could I debug what the problem is?


Solution

  • If you use .format("console") then output won't be in the notebook, it will be in the driver & executor logs - it's a difference between Spark and Databricks.

    If you want to see the data, just use the display function:

    display(df)