azure azure-data-factory azure-databricks azure-data-lake azure-data-lake-gen2

Dataset connection in Azure Databricks

I have published a dataset in Azure Data Factory but there is no way I can access the dataset in the databricks.

The dataset was published from a service that is connected to AWS S3. Here's the picture.

I have tried reading the documentation on Azure but most of them leads to suggesting dropping this data into Azure Data Lake Storage. Is that the only way to access the data in the databricks?

Please provide any good documentation links.

Solution

Configure Azure Databricks to access data directly from AWS S3 by setting up the necessary credentials and configurations.

In AWS, you need to create an IAM role that allows Azure Databricks to access data in your S3 bucket. The IAM role should have permissions to access the necessary S3 buckets and objects.

Know more about external locations and storage credentials and Create an IAM role

Access S3 buckets with Unity Catalog volumes or external locations

The below is the code will help you Read & Write S3 bucket:

Reading the file from S3:

dbutils.fs.ls("s3://my-bucket/external-location/path/to/data")
spark.read.format("parquet").load("s3://my-bucket/external-location/path/to/data")
spark.sql("SELECT * FROM parquet.`s3://my-bucket/external-location/path/to/data`")

Writing the file:

dbutils.fs.mv("s3://my-bucket/external-location/path/to/data", "s3://my-bucket/external-location/path/to/new-location")
df.write.format("parquet").save("s3://my-bucket/external-location/path/to/new-location")