google-cloud-platform google-cloud-storage automl

Generate CSV import file for AutoML Vision from an existing bucket

I already have a GCloud bucket divided by label as follows:

gs://my_bucket/dataset/label1/
gs://my_bucket/dataset/label2/
...

Each label folder has photos inside. I would like to generate the required CSV – as explained here – but I don't know how to do it programmatically, considering that I have hundreds of photos in each folder. The CSV file should look like this:

gs://my_bucket/dataset/label1/photo1.jpg,label1
gs://my_bucket/dataset/label1/photo12.jpg,label1
gs://my_bucket/dataset/label2/photo7.jpg,label2
...

Solution

You need to list all files inside the dataset folder with their complete path and then parse it to obtain the name of the folder containing the file, as in your case this is the label you want to use. This can be done in several different ways. I will include two examples from which you can base your code on:

Gsutil has a method that lists bucket contents, then you can parse the string with a bash script:

 # Create csv file and define bucket path
bucket_path="gs://buckbuckbuckbuck/dataset/"
filename="labels_csv_bash.csv"
touch $filename

IFS=$'\n' # Internal field separator variable has to be set to separate on new lines

# List of every .jpg file inside the buckets folder. ** searches for them recursively.
for i in `gsutil ls $bucket_path**.jpg`
do
        # Cuts the address using the / limiter and gets the second item starting from the end.
        label=$(echo $i | rev | cut -d'/' -f2 | rev)
        echo "$i, $label" >> $filename
done

IFS=' ' # Reset to originnal value

gsutil cp $filename $bucket_path

It also can be done using the Google Cloud Client libraries provided for different languages. Here you have an example using python:

# Imports the Google Cloud client library
import os
from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# The name for the new bucket
bucket_name = 'my_bucket'
path_in_bucket = 'dataset'

blobs = storage_client.list_blobs(bucket_name, prefix=path_in_bucket)

# Reading blobs, parsing information and creating the csv file
filename = 'labels_csv_python.csv'
with open(filename, 'w+') as f:
    for blob in blobs:
        if '.jpg' in blob.name:
            bucket_path = 'gs://' + os.path.join(bucket_name, blob.name)
            label = blob.name.split('/')[-2]
            f.write(', '.join([bucket_path, label]))
            f.write("\n")

# Uploading csv file to the bucket
bucket = storage_client.get_bucket(bucket_name)
destination_blob_name = os.path.join(path_in_bucket, filename)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(filename)