I am running a TensorFlow application on the Google ML Engine with hyper-parameter tuning and I've been running into some strange authentication issues.
My trainer code supports two ways of obtaining input data for my model:
.csv
file.For my IAM permissions, I have two members set up:
My user account:
roles/owner
)roles/bigquery.admin
)gcloud auth application-default login
A service account:
roles/bigquery.admin
)roles/storage.admin
)roles/pubsub.admin
).json
file when I created it in the Google Cloud Platform interface.When I run my trainer code on the Google ML Engine using my user account credentials and reading from a .csv
file, everything works fine.
However, if I try to get my data from BigQuery, I get the following error:
Forbidden: 403 Insufficient Permission (GET https://www.googleapis.com/bigquery/v2/projects/MY-PROJECT-ID/datasets/MY-DATASET-ID/tables/MY-TABLE-NAME)
This is the reason why I created a service account, but the service account has a separate set of issues. When using the service account, I am able to read from both a .csv
file and from BigQuery, but in both cases, I get the following error at the end of each trial:
Unable to log objective metric due to exception <HttpError 403 when requesting https://pubsub.googleapis.com/v1/projects/MY-PROJECT-ID/topics/ml_MY-JOB-ID:publish?alt=json returned "User not authorized to perform this action.">.
This doesn't cause the job to fail, but it prevents the objective metric from being recorded, so the hyper-parameter tuning does not provide any helpful output.
I'm not sure why I'm getting these permission errors when my IAM members are assigned to what I'm pretty sure are the correct roles.
My trainer code works in every case when I run it locally (although PubSub is obviously not being used when running locally), so I'm fairly certain it's not a bug in the code.
Any suggestions?
There was one point at which my service account was getting the same error as my user account when trying to access BigQuery. The solution I stumbled upon is a strange one. I decided to remove all roles from my service account and add them again, and this fixed the BigQuery permission issue for that member.
Thanks for the very detailed question.
To explain what happened here, in the first case Cloud ML Engine used an internal service account (the one that is added to your project with the Cloud ML Service Agent
role). Due to some internal security considerations, that service account is restricted from accessing BigQuery, so hence the first 403 error that you saw.
Now, when you replaced machine credentials with your own service account using the .json
credentials file, that restriction went away. However your service account didn't have all the access to the internal systems, such as the pubsub service used for Hyperparameter tuning mechanism internally. Hence the pubsub error in the second case.
There are a few possible solutions to this problem:
on the Cloud ML Engine side, we're working on better BigQuery support out-of-the-box, although we don't have an ETA at this point.
your approach with a custom service account might work as a short-term solution as long as you don't use Hyperparameter tuning. However this is obviously fragile because it depends on the implementation details in Cloud ML Engine, so I wouldn't recommend relying on this long-term
finally, consider exporting data from BigQuery to GCS first and using GCS to read training data. This scenario is well-supported in Cloud ML Engine. Besides you'll get performance gains on large datasets compared to reading BigQuery directly: the current implementation of BigQueryReader
in TensorFlow has suboptimal perf characteristics, which we're also working to improve.