Search code examples
apache-sparkavroazure-data-lakeazure-databricks

Reading AVRO from Azure Datalake in Databricks


I am trying to read eventhub data (AVRO) format. I am having issues loading data into a dataframe in databricks.

Here's the code I am using. Please let me know if I am doing anything wrong

path='/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/*.avro'

df = spark.read.format("com.databricks.spark.avro") \
    .load(path)

Error

IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI:

I did try using some code to remove the error, but I am getting the syntax errors

import org.apache.spark.sql.SparkSession
SparkSession spark = SparkSession
                     .builder()
                   .config("spark.sql.warehouse.dir","/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/")
                   .getOrCreate()



SyntaxError: invalid syntax
File "<command-265213674761208>", line 2
SparkSession spark = SparkSession

Solution

  • Relative path in absolute URI

    You need to specify the protocol rather than use /mnt

    For example, wasb://some/path/ if reading from Azure blobstore

    You can also exclude *.avro since the Avro reader should already pick up all Avro files in the path

    https://docs.databricks.com/data/data-sources/read-avro.html#python-api

    And if you want to read from EventHub, that exposes a Kafka API, not a filepath, AFAIK