google-bigquery google-cloud-pubsub google-cloud-ml-engine google-iam

Google ML Engine - Unable to log objective metric due to exception <HttpError 403>

I am running a TensorFlow application on the Google ML Engine with hyper-parameter tuning and I've been running into some strange authentication issues.

My Data and Permissions Setup

My trainer code supports two ways of obtaining input data for my model:

Getting a table from BigQuery.
Reading from a .csv file.

For my IAM permissions, I have two members set up:

My user account:
- Assigned to the following IAM roles:
  1. Project Owner (roles/owner)
  2. BigQuery Admin (roles/bigquery.admin)
- Credentials were created automatically when I used gcloud auth application-default login
A service account:
- Assigned to the following IAM roles:
  1. BigQuery Admin (roles/bigquery.admin)
  2. Storage Admin (roles/storage.admin)
  3. PubSub Admin (roles/pubsub.admin)
- Credentials were downloaded to a .json file when I created it in the Google Cloud Platform interface.

The Problem

When I run my trainer code on the Google ML Engine using my user account credentials and reading from a .csv file, everything works fine.

However, if I try to get my data from BigQuery, I get the following error:

    Forbidden: 403 Insufficient Permission (GET https://www.googleapis.com/bigquery/v2/projects/MY-PROJECT-ID/datasets/MY-DATASET-ID/tables/MY-TABLE-NAME)

This is the reason why I created a service account, but the service account has a separate set of issues. When using the service account, I am able to read from both a .csv file and from BigQuery, but in both cases, I get the following error at the end of each trial:

    Unable to log objective metric due to exception <HttpError 403 when requesting https://pubsub.googleapis.com/v1/projects/MY-PROJECT-ID/topics/ml_MY-JOB-ID:publish?alt=json returned "User not authorized to perform this action.">.

This doesn't cause the job to fail, but it prevents the objective metric from being recorded, so the hyper-parameter tuning does not provide any helpful output.

The Question

I'm not sure why I'm getting these permission errors when my IAM members are assigned to what I'm pretty sure are the correct roles.

My trainer code works in every case when I run it locally (although PubSub is obviously not being used when running locally), so I'm fairly certain it's not a bug in the code.

Any suggestions?

Notes

There was one point at which my service account was getting the same error as my user account when trying to access BigQuery. The solution I stumbled upon is a strange one. I decided to remove all roles from my service account and add them again, and this fixed the BigQuery permission issue for that member.

Solution

Thanks for the very detailed question.

To explain what happened here, in the first case Cloud ML Engine used an internal service account (the one that is added to your project with the Cloud ML Service Agent role). Due to some internal security considerations, that service account is restricted from accessing BigQuery, so hence the first 403 error that you saw.

Now, when you replaced machine credentials with your own service account using the .json credentials file, that restriction went away. However your service account didn't have all the access to the internal systems, such as the pubsub service used for Hyperparameter tuning mechanism internally. Hence the pubsub error in the second case.

There are a few possible solutions to this problem:

on the Cloud ML Engine side, we're working on better BigQuery support out-of-the-box, although we don't have an ETA at this point.
your approach with a custom service account might work as a short-term solution as long as you don't use Hyperparameter tuning. However this is obviously fragile because it depends on the implementation details in Cloud ML Engine, so I wouldn't recommend relying on this long-term
finally, consider exporting data from BigQuery to GCS first and using GCS to read training data. This scenario is well-supported in Cloud ML Engine. Besides you'll get performance gains on large datasets compared to reading BigQuery directly: the current implementation of BigQueryReader in TensorFlow has suboptimal perf characteristics, which we're also working to improve.