Search code examples
pythontensorflowgoogle-cloud-platformgoogle-cloud-storagetfrecord

Output TFRecord to Google Cloud Storage from Python


I know tf.python_io.TFRecordWriter has a concept of GCS, but it doesn't seem to have permissions to write to it.

If I do the following:

output_path = 'gs://my-bucket-name/{}/{}.tfrecord'.format(object_name, record_name)
writer = tf.python_io.TFRecordWriter(output_path)
# write to writer
writer.close()

then I get 401s saying "Anonymous caller does not have storage.objects.create access to my-bucket-name."

However, on the same machine, if I do gsutil rsync -d r gs://my-bucket-name bucket-backup, it properly syncs it, so I've authenticated properly using gcloud.

How can I give TFRecordWriter permissions to write to GCS? I'm going to just use Google's GCP python API for now, but I'm sure there's a way to do this using TF alone.


Solution

  • A common strategy to setup credentials on systems is to use Application Default Credentials (ADC). ADC is a strategy to locate Google Cloud Service Account credentials.

    If the environment variable GOOGLE_APPLICATION_CREDENTIALS is set, ADC will use the filename that the variable points to for service account credentials. This file is a Google Cloud Service Account credentials file in Json format. The previous P12 (PFX) certificates are deprecated.

    If the environment variable is not set, the default service account is used for credentials if the application running on Compute Engine, App Engine, Kubernetes Engine or Cloud Functions.

    If the previous two steps fail to find valid credentials, ADC will fail, and an error occurs.

    For this questions, ADC could not find credentials and the TensorFlow writes to GCS failed.

    The solution is to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the service account Json file.

    For Linux:

    export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
    

    For Windows

    set GOOGLE_APPLICATION_CREDENTIALS=C:\path\to\service-account.json
    

    I wrote an article that goes into more detail on ADC.

    Google Cloud Application Default Credentials