Search code examples
azurepowershellazure-storageazure-databricks

Calculate size of a folder in Azure data lake Gen2 by excluding the blobs in Archive tier


I need to calculate the size of an ADLS folder but need to make sure blobs in the Archive layer are excluded from the list. If I use

$Blobs = Get-AzStorageBlob -Context $ctx -Container $containerName -Prefix $folderName 

Its giving the size but there is no way I can filter out the access tier. enter image description here

But If I use BlobServiceClient, the code is not scalable, it runs forever if I have millions of files.

blob_service_client = BlobServiceClient(account_url=account_url, credential=storage_account_key)
container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs(name_starts_with=folder_path)
for blob in blob_list:
        blob_client = container_client.get_blob_client(blob.name)
    blob_properties = blob_client.get_blob_properties()
    if blob_properties.blob_tier != "Archive":
            total_size += blob_properties.size

Is there an easy and scalable way to achieve this?

Thanks


Solution

  • Currently your code is not optimized. You need not call blob_client.get_blob_properties() method to get the properties of the blob for each blob. They should already be available when you list the blobs.

    Please try the following code:

    blob_service_client = BlobServiceClient(account_url=account_url, credential=storage_account_key)
    container_client = blob_service_client.get_container_client(container_name)
    blob_list = container_client.list_blobs(name_starts_with=folder_path)
    for blob in blob_list:
        if blob.blob_tier != "Archive":
                total_size += blob_properties.size
    

    Also, looking at the documentation of Get-AzStorageBlob, the output of the cmdlet would be a list of blobs which are of type AzureStorageBlob and that has a property called AccessTier. What you can do is loop through the blobs returned by this Cmdlet and filter by access tier to get the desired information.