Search code examples
google-cloud-platformkopsgoogle-anthos

GKE Connect succesfully starts but the cluster is not shown at GCP consoles


Good morning!

I've been playing around with GKE Connect lately and I've been trying to register my "remote"-kops generated clusters on both GCP and AWS VMs so that I can monitor them on the GCP console.

If you've not read about GKE Connect, you can find the official documentation here.

Now the issue is that after following multiple tutorials and trying everything out, the GKE Connect agent seems to be running properly on my k8s clusters, but they're never shown in my GCP console as remote clusters - you can find guidance on the steps I've taken on this repository.

Basically the steps I've taken are as follows:

  1. Enable the required GCP APIs
  2. Create a service account for the target cluster
  3. Assign the gkehub.connect role to the created SA
  4. Generate the SA's private key
  5. Start the agent up using the following command:
gcloud alpha container hub register-cluster ${CLUSTER_NAME} \
  --context=${CLUSTER_NAME} \
  --service-account-key-file=/var/lib/jenkins/gke-connect/${SERVICE_ACC}-gke-connect-creds.json \
  --project=${CLOUD_PROJECT}

The agent is deployed at my cluster, and the container logs display as follows:

2019/12/13 08:57:03.403373 dialer.go:261: dialer: dial: connecting to gkeconnect.googleapis.com:443...
2019/12/13 08:57:03.515452 dialer.go:272: dialer: dial: connected to gkeconnect.googleapis.com:443
2019/12/13 08:57:03.515483 tunnel.go:314: serve: opening egress stream...
2019/12/13 08:57:03.515545 tunnel.go:322: serve: registering project_number="681949624886", connection_id="db3fb4d9-1d7f-11ea-927b-0218619c9f84" connection_class="DEFAULT" agent_version="20191206-03-00" ...
2019/12/13 08:57:03.515592 dialer.go:222: Dial successful, current connections: 3
2019/12/13 08:57:08.515779 tunnel.go:374: serve: serving requests...

As a side note, API requests seem to be taking very long - GCP's API console displays average of 8 minutes response time. Have you guys experienced anything similar?

Thanks!

Edit 1 Adding further information

Not sure if this is how it's meant to work since it's not documented anywhere, but the GKE Connect agent seems to be handling 3 connectors which disconnect after 5 to 8 minutes time with the following trace pattern:

2019/12/13 11:04:30.519779 dialer.go:277: dialer: dial: connection to gkeconnect.googleapis.com:443 closed after 8m1.174074486s
2019/12/13 11:04:30.519831 dialer.go:204: dialer: connection done: <nil>
2019/12/13 11:04:30.519839 dialer.go:305: dialer: backoff: reset
2019/12/13 11:04:30.519847 dialer.go:236: dialer: dial interval was 5m0.950672921s
2019/12/13 11:04:30.519859 dialer.go:180: dialer: waiting for next event, outstanding connections=2

Edit 2 Connectivity

Connectivity to the required endpoints also seems to be fine from within a container deployed on my cluster:

/usr/src/app # ping oauth2.googleapis.com
PING oauth2.googleapis.com (172.217.21.234): 56 data bytes
64 bytes from 172.217.21.234: seq=0 ttl=48 time=1.169 ms
64 bytes from 172.217.21.234: seq=1 ttl=48 time=1.165 ms

--- oauth2.googleapis.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.165/1.167/1.169 ms

/usr/src/app # ping gkeconnect.googleapis.com
PING gkeconnect.googleapis.com (172.217.22.42): 56 data bytes
64 bytes from 172.217.22.42: seq=0 ttl=48 time=1.115 ms
64 bytes from 172.217.22.42: seq=1 ttl=48 time=1.201 ms

--- gkeconnect.googleapis.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.115/1.158/1.201 ms
/usr/src/app # ping gkehub.googleapis.com
PING gkehub.googleapis.com (216.58.206.10): 56 data bytes
64 bytes from 216.58.206.10: seq=0 ttl=48 time=1.374 ms
64 bytes from 216.58.206.10: seq=1 ttl=48 time=1.428 ms

--- gkehub.googleapis.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.374/1.401/1.428 ms

/usr/src/app # ping www.googleapis.com
PING www.googleapis.com (172.217.16.202): 56 data bytes
64 bytes from 172.217.16.202: seq=0 ttl=48 time=1.357 ms
64 bytes from 172.217.16.202: seq=1 ttl=48 time=1.382 ms

--- www.googleapis.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.357/1.369/1.382 ms

/usr/src/app # ping accounts.google.com
PING accounts.google.com (172.217.23.141): 56 data bytes
64 bytes from 172.217.23.141: seq=0 ttl=48 time=1.447 ms
64 bytes from 172.217.23.141: seq=1 ttl=48 time=1.400 ms

--- accounts.google.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.400/1.423/1.447 ms

/usr/src/app # ping gcr.io
PING gcr.io (173.194.76.82): 56 data bytes
64 bytes from 173.194.76.82: seq=0 ttl=32 time=10.311 ms
64 bytes from 173.194.76.82: seq=1 ttl=32 time=10.386 ms

--- gcr.io ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 10.311/10.348/10.386 ms

Edit 3 Further testing

Thanks to Armando in the comments, I took a second look at the official Anthos workshop. Also found these codelabs which basically come to tell the same story.

They seem to claim a whitelisted service account is required for the cluster registration, but they never really state what the "whitelisting" process is like.

Checking out the GKE Connect scripts, this one does pretty much what I'm doing myself: create the service account, provide the required permissions, register my cluster and generate a KSA whose key I can use to access the cluster at the GCP console.

Now there's that sketchy line about the whitelisting process which may be the key to fix this issue, but I'm surprised I've not been able to find any reference to said process.


Solution

  • Anthos by Google Cloud requires a paid subscription in order to use. The documents you're reviewing work on existing subscriptions. To try or buy Anthos you'll need to contact sales. The links are on the main Anthos page here https://cloud.google.com/anthos/