I have stored data in Azure data lake in different folders and sub folders. I want to know the size of the data stored.
I am looking for some function/code which we can run in azure data bricks to run that recursively which will give us the size of data.
Code I tried initially was hitting some error because of some mistake. Below code works properly -
%python
# Specify the root path to your ADLS Gen2 container
root_path = “abfss://<container-name>@<storage-account>.dfs.core.windows.net/<Path>”
# Function to calculate the size of a directory and its subdirectories recursively
def calculate_directory_size(directory_path):
total_size = 0
for file_info in dbutils.fs.ls(directory_path):
if file_info.isDir():
total_size += calculate_directory_size(file_info.path)
else:
total_size += file_info.size
return total_size
# List all directories within the root path
directories = [f.path for f in dbutils.fs.ls(root_path) if f.isDir()]
# Calculate and print the size of each directory and its subdirectories
for directory in directories:
directory_name = directory.split("/")[-1]
print("directory", directory)
directory_size = calculate_directory_size(directory)
print(f"Data volume in {directory_name}: {directory_size} bytes")
# Convert bytes to gigabytes (GB) for readability
directory_size_gb = directory_size / (1024 ** 3)
print(f"Data volume in {directory_name}: {directory_size_gb:.5f} GB")