azure azure-blob-storage azure-databricks

How do you process many files from a blob storage with long paths in databricks?

I've enabled logging for an API Management service and the logs are being stored in a storage account. Now I'm trying to process them in an Azure Databricks workspace but I'm struggling with accessing the files.

The issue seems to be that the automatically generated virtual folder structure looks like this:

/insights-logs-gatewaylogs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=*/m=*/d=*/h=*/m=00/PT1H.json

I've mounted the insights-logs-gatewaylogs container under /mnt/diags and a dbutils.fs.ls('/mnt/diags') correctly lists the resourceId= folder but dbutils.fs.ls('/mnt/diags/resourceId=') claims file not found

If I create empty marker blobs along the virtual folder structure I can list each subsequent level but that strategy obviously falls down since the last part of the path is dynamically organized by year/month/day/hour.

For example a

spark.read.format('json').load("dbfs:/mnt/diags/logs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=*/m=*/d=*/h=*/m=00/PT1H.json")

Yields in this error:

java.io.FileNotFoundException: File/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=2019 does not exist.

So clearly the wild-card has found the first year folder but is refusing to go further down.

I setup a copy job in Azure Data Factory that copies all the json blobs within the same blob storage account successfully and removes the resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service> prefix (so the root folder starts with the year component) and that can be accessed successfully all the way down without having to create empty marker blobs.

So the problem seems to be related the to the long virtual folder structure which is mostly empty.

Is there another way on how to process these kind of folder structures in databricks?

Update: I've also tried providing the path as part of the source when mounting but that doesn't help either

Solution

I think I may have found the root cause of this. Should have tried this earlier but I provided the exact path to an existing blob like this:

spark.read.format('json').load("dbfs:/mnt/diags/logs/resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>/y=2019/m=08/d=20/h=06/m=00/PT1H.json")

And I got a more meaningful error back:

shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

Turns out the out-of-the box logging creates append blobs (and there doesn't seem to be a way to change this) and support for append blobs is still WIP by the looks of this ticket: https://issues.apache.org/jira/browse/HADOOP-13475

The FileNotFoundException could be a red herring which might caused by the inner exception being swallowed when trying expand the wild-cards and finding an unsupported blob type.

Update

Finally found a reasonable work-around. I installed the azure-storage Python package in my workspace (if you're at home with Scala it's already installed) and did the blob loading myself. Most code below is to add globbing support, you don't need it if you're happy to just match on path prefix:

%python

import re
import json
from azure.storage.blob import AppendBlobService


abs = AppendBlobService(account_name='<account>', account_key="<access_key>")

base_path = 'resourceId=/SUBSCRIPTIONS/<subscription>/RESOURCEGROUPS/<resource group>/PROVIDERS/MICROSOFT.APIMANAGEMENT/SERVICE/<api service>'
pattern = base_path + '/*/*/*/*/m=00/*.json'
filter = glob2re(pattern)

spark.sparkContext \
     .parallelize([blob.name for blob in abs.list_blobs('insights-logs-gatewaylogs', prefix=base_path) if re.match(filter, blob.name)]) \
     .map(lambda blob_name: abs.get_blob_to_bytes('insights-logs-gatewaylogs', blob_name).content.decode('utf-8').splitlines()) \
     .flatMap(lambda lines: [json.loads(l) for l in lines]) \
     .collect()

glob2re is courtesy of https://stackoverflow.com/a/29820981/220986:

def glob2re(pat):
    """Translate a shell PATTERN to a regular expression.

    There is no way to quote meta-characters.
    """

    i, n = 0, len(pat)
    res = ''
    while i < n:
        c = pat[i]
        i = i+1
        if c == '*':
            #res = res + '.*'
            res = res + '[^/]*'
        elif c == '?':
            #res = res + '.'
            res = res + '[^/]'
        elif c == '[':
            j = i
            if j < n and pat[j] == '!':
                j = j+1
            if j < n and pat[j] == ']':
                j = j+1
            while j < n and pat[j] != ']':
                j = j+1
            if j >= n:
                res = res + '\\['
            else:
                stuff = pat[i:j].replace('\\','\\\\')
                i = j+1
                if stuff[0] == '!':
                    stuff = '^' + stuff[1:]
                elif stuff[0] == '^':
                    stuff = '\\' + stuff
                res = '%s[%s]' % (res, stuff)
        else:
            res = res + re.escape(c)
    return res + '\Z(?ms)'

Not pretty but avoids the copying around of data and can be wrapped up in a little utility class.