Search code examples
azureazure-blob-storageazure-databricksazure-data-factorysparkr

How to Read Append Blobs as DataFrames in Azure DataBricks


My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.

zipContainer/deviceA/component1/20220301.zip

The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.

unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv

I enabled the logging of the copy activity as:

enter image description here

And then provided the folder path to store the generated logs (in txt format), which have the following structure:

Timestamp Level OperationName OperationItem Message
2022-03-01 15:14:06.9880973 Info FileWrite "deviceA/component1/2022.zip/measurements_01.csv" "Complete writing file. File is successfully copied."

I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:

Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")

The following exception is returned:

Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.

From the above scenario, how can I read these logs successfully into my R session in DataBricks ?


Solution

  • According to this Microsoft documentation, Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs.

    https://learn.microsoft.com/en-us/azure/databricks/kb/data-sources/wasb-check-blob-types

    And when you try to read this log file of append blob type, it gives error saying that Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

    So, you cannot read the log file of append blob type from blob storage account. A solution to this would be to use an azure datalake gen2 storage container for logging. When you run the pipeline using ADLS gen2 for logs, it creates log file of block blob type. You can now read this file without any issue from databricks.

    Using blob storage for logging:

    enter image description here

    Using ADLS gen2 for logging:

    enter image description here