Search code examples

List folders by size on gcs bucket with dataflow

Looking through the code on this question, I want to be able to create a dataflow pipeline that can look at all the files within a specific gcs bucket folder and state the final subdirectories with the greatest amount of data in terms of bytes. I would write code that is similar to :

class SortFiles(beam.DoFn):
  def __init__(self, gfs):
    self.gfs = gfs

  def process(self, file_metadata):
    if file_metadata.size_in_bytes > 0:
      # Sort the files here? 

class SortFolders(beam.DoFn):
  def __init__(self, gfs):
    self.gfs = gfs

  def process(self, file_metadata):
    if file_metadata.size_in_bytes > 0:
      # Sort the folders here based on maximum addition of a combination 
      # of the file sizes and file numbers 

def delete_empty_files():

    options = PipelineOptions(...)

    gfs = gcs.GCSFileSystem(pipeline_options)
    p = beam.Pipeline(options=pipeline_options)

    discover_empty = p | 'Filenames' >> beam.Create(gfs.match(gs_folder).metadata_list)
                        | 'Reshuffle' >> beam.Reshuffle() 
                        | 'SortFilesbySize' >> beam.ParDo(SortFiles(gfs))
                        | 'SortFoldersbySize' >> beam.ParDo(SortFolders(gfs))
                        | 'OutputFolders' >> ...

I have not decided on whether to list the folders by the total amount of bytes or the total number of files within them. How would I go about solving this issue? Another issue lies in the fact that I want to be able to find the final sub directory and not its parent folders for this task.


  • GCSFileSystem has a function, du that will tell you the size under a particular path.

    In reading your question I think you want to

    1. first find all directories within the bucket that do not themselves contain directories (if I understand 'final subdirectories')
    2. then run du on each of them,
    3. then sort that resulting list on size

    If you are trying to do a count of files nested:

    1. list all objects, names will be a/, a/b.txt, a/b/c.txt, etc
    2. write a function to count the objects nested under each sub path