Search code examples
pythongoogle-cloud-platformjupyter-notebookgoogle-cloud-storagegoogle-cloud-dataproc

How to get the list of files in the GCS Bucket using the Jupyter notebook in Dataproc?



I have recently started using GCP for my project and encountered difficulties when working with the bucket from the Jupyter notebook in the Dataproc cluster. At the moment I have a bucket with a bunch of files in it, and a Dataproc cluster with the Jupyter notebook. What I am trying to do is go over all the files in the bucket and extract the data from them to create a dataframe.

I can access one file at a time with the following code: data = spark.read.csv('gs://BUCKET_NAME/PATH/FILENAME.csv'), but there are hundreds of files, and I cannot write a line of code for each of them. Usually, I would do something like this:

import os
for filename in os.listdir(directory):
...

but this does not seem to work here. So, I was wondering, how do I iterate over files in a bucket using Jupyter notebook in the Dataproc cluster?

Would appreciate any help!


Solution

  • You can list the elements in your bucket with the following commands:

    from google.cloud import storage
    client = storage.Client()
    BUCKET_NAME = 'your_bucket_name'
    bucket = client.get_bucket(BUCKET_NAME)
    elements = bucket.list_blobs()
    files=[a.name for a in elements]
    

    If there are no folders in your bucket, the list called files will contain the names of the files.