Search code examples
spark-streamingazure-eventhub

Pyspark: authenticate via Service Principal to Event Hub


I have a service principal which has permissions to read topics on the Eventhub. I want to read topics from it and tried the following:

# Event Hub connection string using Service Principal (SAS)
eventhub_conn_str = f"Endpoint=sb://{namespace}/;SharedAccessKeyName={client_id};SharedAccessKey={client_secret}"


# Setting up the Azure Event Hubs Spark connector
event_hub_config = {
    'eventhubs.connectionString':eventhub_conn_str,
    'eventhubs.consumerGroup': consumer_group,
    'eventhubs.maxEventsPerTrigger': '1000'
}


# Read from Event Hub using Spark Structured Streaming
raw_events = (
    spark.readStream
    .format("eventhubs")
    .option("startingPosition", "@latest")
    .options(**event_hub_config)
    .load()
)

However, I always get the following error:

java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit

It seems like my authentication fails (i get the same error when using any password). How can I authenticate with client_id and client_secret?


Solution

  • The connection string you are using for Event hub connection is for the shared key authentication and you are passing Service principal client id and secret in that connection string which is incompatible, and it is not able to detect it properly may be the reason of error.

    You need to create a callback class extends from org.apache.spark.eventhubs.utils.AadAuthenticationCallback to use Service Principal with Secret to Authorize EventHub from PySpark.

    To connect into the Azure EventHubs connection, you must still utilize Java-based code, as stated in the connector documentation.

    For more details refer this stack solution