I have an image which will run my training job. The training data is in a Cloud SQL database. When I run the cloud_sql_proxy on my local machine, the container can connect just fine.
❯ docker run --rm us.gcr.io/myproject/trainer:latest mysql -uroot -h"'172.17.0.2'" -e"'show databases;'"
Running: `mysql -uroot -h'172.17.0.2' -e'show databases;'`
Database
information_schema
mytrainingdatagoeshere
mysql
performance_schema
I'm using mysql
just to test the connection, the actual training command is elsewhere in the container. When I try this via the AI Platform, I can't connect.
❯ gcloud ai-platform jobs submit training firsttry3 \
--region us-west2 \
--master-image-uri us.gcr.io/myproject/trainer:latest \
-- \
mysql -uroot -h"'34.94.1.2'" -e"'show tables;'"
Job [firsttry3] submitted successfully.
Your job is still active. You may view the status of your job with the command
$ gcloud ai-platform jobs describe firsttry3
or continue streaming the logs with the command
$ gcloud ai-platform jobs stream-logs firsttry3
jobId: firsttry3
state: QUEUED
❯ gcloud ai-platform jobs stream-logs firsttry3
INFO 2019-12-16 22:58:23 -0700 service Validating job requirements...
INFO 2019-12-16 22:58:23 -0700 service Job creation request has been successfully validated.
INFO 2019-12-16 22:58:23 -0700 service Job firsttry3 is queued.
INFO 2019-12-16 22:58:24 -0700 service Waiting for job to be provisioned.
INFO 2019-12-16 22:58:26 -0700 service Waiting for training program to start.
ERROR 2019-12-16 22:59:32 -0700 master-replica-0 Entered Slicetool Container
ERROR 2019-12-16 22:59:32 -0700 master-replica-0 Running: `mysql -uroot -h'34.94.1.2' -e'show tables;'`
ERROR 2019-12-16 23:01:44 -0700 master-replica-0 ERROR 2003 (HY000): Can't connect to MySQL server on '34.94.1.2'
It seems like the host isn't accessible from wherever the job gets run. How can I grant AI platform access to Cloud Sql?
I have considered including the cloud sql proxy in the training container, and then injecting service account credentials as user args, but since they're both in the same project I was hoping that there would be no need for this step. Are these hopes misplaced?
So unfortunately, not all Cloud products get sandboxed into the same network, so you won't be able to connect automatically between products. So the issue you're having is that AI Platform can't automatically reach the Cloud SQL instance at the 34.xx.x.x IP address.
There's a couple ways you can look into fixing it, although caveat, I don't know AI Platform's networking setup well (I'll have to do it and blog about it here soonish). First, is you can try to see if you can connect AI Platform to a VPC (Virtual Private Cloud) network, and put your Cloud SQL instance into the same VPC. That will allow them to talk to each other over a Private IP (going to likely be different than the IP you have now). In the Connection details for the Cloud SQL instance you should see if you have a Private IP, and if not, you can enable it in the instance settings (requires a shutdown and restart). Otherwise, you can be sure a Public IP address is setup, which might be the 34.xx.x.x IP, and then allowlist (whitelist, but I'm trying to change the terminology) the Cloud IP address for AI Platform.
You can read about the way GCP handles IP ranges here: https://cloud.google.com/compute/docs/ip-addresses/
Once those ranges are added to the Authorized Networks in the Cloud SQL connection settings you should be able to connect directly from AI Platform.
Original response:
Where's the proxy running when you're trying to connect to it from the AI platform? Still on your local machine? So basically, in scenario 1, you're running the container locally with docker run, and connecting to your local IP: 172.17.0.2, and then when you shift up to the AI platform, you're connecting to your local machine at 34.xx.x.x? So first, you probably want to remove your actual home IP address from your original question. People are rude on the internet and that could end badly if that's really your home IP. Second, how sure are you that you've opened a hole in your firewall to allow traffic in from the AI platform? Generally speaking, that would be where I'd assume the issue is, that your connection on your local machine is being refused, and the error that results is the unable to connect.