Search code examples
pythonapache-sparkpysparkdatabricksazure-databricks

How to calculate a Directory size in ADLS using PySpark?


I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files. I want total size of all the files and everything inside XYZ.

I could find out all the folders inside a particular path. But I want size of all together. Also I see

display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))

gives me data size of abc file. But I want complete size of XYZ.


Solution

  • The dbutils.fs.ls doesn't have a recurse functionality like cp, mv or rm. Thus, you need to iterate yourself. Here is a snippet that will do the task for you. Run the code from a Databricks Notebook.

    from dbutils import FileInfo
    from typing import List
    
    root_path = "/mnt/datalake/.../XYZ"
    
    def discover_size(path: str, verbose: bool = True):
      def loop_path(paths: List[FileInfo], accum_size: float):
        if not paths:
          return accum_size
        else:
          head, tail = paths[0], paths[1:]
          if head.size > 0:
            if verbose:
              print(f"{head.path}: {head.size / 1e6} MB")
            accum_size += head.size / 1e6
            return loop_path(tail, accum_size)
          else:
            extended_tail = dbutils.fs.ls(head.path) + tail
            return loop_path(extended_tail, accum_size)
    
      return loop_path(dbutils.fs.ls(path), 0.0)
    
    discover_size(root_path, verbose=True)  # Total size in megabytes at the end
    
    

    If the location is mounted in the dbfs. Then you could use the du -h approach (have not test it). If you are in the Notebook, create a new cell with:

    %sh
    du -h /mnt/datalake/.../XYZ