Search code examples
google-cloud-platformgoogle-cloud-storageautoml

Generate CSV import file for AutoML Vision from an existing bucket


I already have a GCloud bucket divided by label as follows:

gs://my_bucket/dataset/label1/
gs://my_bucket/dataset/label2/
...

Each label folder has photos inside. I would like to generate the required CSV – as explained here – but I don't know how to do it programmatically, considering that I have hundreds of photos in each folder. The CSV file should look like this:

gs://my_bucket/dataset/label1/photo1.jpg,label1
gs://my_bucket/dataset/label1/photo12.jpg,label1
gs://my_bucket/dataset/label2/photo7.jpg,label2
...

Solution

  • You need to list all files inside the dataset folder with their complete path and then parse it to obtain the name of the folder containing the file, as in your case this is the label you want to use. This can be done in several different ways. I will include two examples from which you can base your code on:

    Gsutil has a method that lists bucket contents, then you can parse the string with a bash script:

     # Create csv file and define bucket path
    bucket_path="gs://buckbuckbuckbuck/dataset/"
    filename="labels_csv_bash.csv"
    touch $filename
    
    IFS=$'\n' # Internal field separator variable has to be set to separate on new lines
    
    # List of every .jpg file inside the buckets folder. ** searches for them recursively.
    for i in `gsutil ls $bucket_path**.jpg`
    do
            # Cuts the address using the / limiter and gets the second item starting from the end.
            label=$(echo $i | rev | cut -d'/' -f2 | rev)
            echo "$i, $label" >> $filename
    done
    
    IFS=' ' # Reset to originnal value
    
    gsutil cp $filename $bucket_path
    

    It also can be done using the Google Cloud Client libraries provided for different languages. Here you have an example using python:

    # Imports the Google Cloud client library
    import os
    from google.cloud import storage
    
    # Instantiates a client
    storage_client = storage.Client()
    
    # The name for the new bucket
    bucket_name = 'my_bucket'
    path_in_bucket = 'dataset'
    
    blobs = storage_client.list_blobs(bucket_name, prefix=path_in_bucket)
    
    # Reading blobs, parsing information and creating the csv file
    filename = 'labels_csv_python.csv'
    with open(filename, 'w+') as f:
        for blob in blobs:
            if '.jpg' in blob.name:
                bucket_path = 'gs://' + os.path.join(bucket_name, blob.name)
                label = blob.name.split('/')[-2]
                f.write(', '.join([bucket_path, label]))
                f.write("\n")
    
    # Uploading csv file to the bucket
    bucket = storage_client.get_bucket(bucket_name)
    destination_blob_name = os.path.join(path_in_bucket, filename)
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(filename)