Search code examples
azuredatabricksazure-databricksazure-data-lakeazure-data-lake-gen2

How can I determine the total storage size of data stored in both folders and subdirectories within Azure Data Lake?


I have stored data in Azure data lake in different folders and sub folders. I want to know the size of the data stored.

I am looking for some function/code which we can run in azure data bricks to run that recursively which will give us the size of data.


Solution

  • Code I tried initially was hitting some error because of some mistake. Below code works properly -

        %python
        # Specify the root path to your ADLS Gen2 container
        root_path = “abfss://<container-name>@<storage-account>.dfs.core.windows.net/<Path>”
    
        # Function to calculate the size of a directory and its subdirectories recursively
        def calculate_directory_size(directory_path):
            total_size = 0
            for file_info in dbutils.fs.ls(directory_path):
               if file_info.isDir():
                  total_size += calculate_directory_size(file_info.path)
               else:
                  total_size += file_info.size
            return total_size
    
        # List all directories within the root path
        directories = [f.path for f in dbutils.fs.ls(root_path) if f.isDir()]
    
        # Calculate and print the size of each directory and its subdirectories
        for directory in directories:
            directory_name = directory.split("/")[-1]
            print("directory", directory)
            directory_size = calculate_directory_size(directory)
            print(f"Data volume in {directory_name}: {directory_size} bytes")
    
            # Convert bytes to gigabytes (GB) for readability
            directory_size_gb = directory_size / (1024 ** 3)
            print(f"Data volume in {directory_name}: {directory_size_gb:.5f} GB")