Search code examples
bashgoogle-cloud-storagegoogle-cloud-pubsubgoogle-cloud-shell

Replace whitespace (" ") with underscore ("_") in every filename in Google Cloud Storage programatically


I have a number of .csv files of tabular data stored in different folders of a Cloud Storage bucket that have been imported from an external data source. Every day, a new file is imported into each folder of the Cloud Storage bucket. Each file contains a whitespace (" ") in the filename with the ".csv" extension. I have written a Cloud Function to copy every existing file from this source bucket to a newly created cleaned bucket and modify the filename by replacing the space " " character with a dash "-" character. Is there a way to implement that the Cloud Function only does this to the new file being uploaded using Cloud Functions and Pub/Sub instead of the approach of doing a manual scan of which files are in both buckets? Essentially what I would like to do is to send and access the filename and file metadata in the Pub/Sub event, but I am not aware of how to send and access this data in the Pub/Sub event.

Thanks in advance!


Solution

  • This Answer by Marc Anthony B explains renaming the filename by removing square brackets []. You can follow the same to remove white space and replace with underscore by changing the regex pattern like below.

    The code will basically follow these 3 steps

    1. List the objects that you want to rename.
    2. Iterate that list.
    3. For each object, change the name. The files aren´t renamed in the backend. It performs a copy followed by a delete for each object you're renaming.
    import re
    from google.cloud import storage
    
    storage_client = storage.Client()
    
    bucket_name = "my_bucket"
    bucket = storage_client.bucket(bucket_name)
    
    storage_client = storage.Client()
    
    blobs = storage_client.list_blobs(bucket_name)
    pattern = r"\s"  #  regex for detecting whitespace
    for blob in blobs:
        if re.match(pattern, blob.name):
            fixed_var = re.sub(pattern, "_", blob.name)
            new_blob = bucket.rename_blob(blob, fixed_var)
            print("Changed")
        print("No change required")
    

    You can also use the gsutil mv command to rename all objects with a given prefix to have a new prefix.you can refer this document for more information

    gsutil mv gs://my_bucket/oldprefix gs://my_bucket/newprefix