Search code examples
socket.iokubernetesgoogle-cloud-platformgoogle-kubernetes-engineuwebsockets

websockets on GKE with istio gives 'no healthy upstream' and 'CrashLoopBackOff'


I am on GKE using Istio version 1.0.3 . I try to get my express.js with socket.io (and uws engine) backend working with websockets and had this backend running before on a 'non kubernetes server' with websockets without problems.

When I simply enter the external_gke_ip as url I get my backend html page, so http works. But when my client-app makes socketio authentication calls from my client-app I get 503 errors in the browser console:

WebSocket connection to 'ws://external_gke_ip/socket.io/?EIO=3&transport=websocket' failed: Error during WebSocket handshake: Unexpected response code: 503

And when I enter the external_gke_ip as url while socket calls are made I get: no healthy upstream in the browser. And the pod gives: CrashLoopBackOff.

I find somewhere: 'in node.js land, socket.io typically does a few non-websocket Handshakes to the Server before eventually upgrading to Websockets. If you don't have sticky-sessions, the upgrade never works.' So maybe I need sticky sessions? Or not... as I just have one replica of my app? It seems to be done by setting sessionAffinity: ClientIP, but with istio I do not know how to do this and in the GUI I can edit some values of the loadbalancers, but Session affinity shows 'none' and I can not edit it.

Other settings that might be relevant and that I am not sure of (how to set using istio) are:

  • externalTrafficPolicy=Local
  • Ttl

My manifest config file:

apiVersion: v1
kind: Service
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  selector:
    app: myapp
  ports:
  - port: 8089
    targetPort: 8089
    protocol: TCP
    name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: app
          image: gcr.io/myproject/firstapp:v1
          imagePullPolicy: Always
          ports:
          - containerPort: 8089
          env:
          - name: POSTGRES_DB_HOST
            value: 127.0.0.1:5432
          - name: POSTGRES_DB_USER
            valueFrom:
              secretKeyRef:
                name: mysecret
                key: username
          - name: POSTGRES_DB_PASSWORD
            valueFrom:
              secretKeyRef:
                name: mysecret
                key: password
          readinessProbe:
            httpGet:
              path: /healthz
              scheme: HTTP
              port: 8089
            initialDelaySeconds: 10
            timeoutSeconds: 5
        - name: cloudsql-proxy
          image: gcr.io/cloudsql-docker/gce-proxy:1.11
          command: ["/cloud_sql_proxy",
                    "-instances=myproject:europe-west4:osm=tcp:5432",
                    "-credential_file=/secrets/cloudsql/credentials.json"]
          securityContext:
            runAsUser: 2
            allowPrivilegeEscalation: false
          volumeMounts:
            - name: cloudsql-instance-credentials
              mountPath: /secrets/cloudsql
              readOnly: true
      volumes:
        - name: cloudsql-instance-credentials
          secret:
            secretName: cloudsql-instance-credentials
---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: myapp-gateway
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
      - "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - "*"
  gateways:
  - myapp-gateway
  http:
  - match:
    - uri:
        prefix: /
    route:
    - destination:
        host: myapp
      weight: 100
    websocketUpgrade: true
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: google-apis
spec:
  hosts:
    - "*.googleapis.com"
  ports:
    - number: 443
      name: https
      protocol: HTTPS
  location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: cloud-sql-instance
spec:
  hosts:
  - 35.204.XXX.XX # ip of cloudsql database
  ports:
  - name: tcp
    number: 3307
    protocol: TCP
  location: MESH_EXTERNAL

Various output (while making socket calls, when I stop these the deployment restarts and READY returns to 3/3):

kubectl get pods
NAME                    READY   STATUS             RESTARTS   AGE
myapp-8888  2/3     CrashLoopBackOff   11         1h

$ kubectl describe pod/myapp-8888 gives:

Name:           myapp-8888
Namespace:      default
Node:           gke-standard-cluster-1-default-pool-888888-9vtk/10.164.0.36
Start Time:     Sat, 19 Jan 2019 14:33:11 +0100
Labels:         app=myapp
  pod-template-hash=207157
Annotations:
kubernetes.io/limit-ranger:
LimitRanger plugin set: cpu request for container app; cpu request for container cloudsql-proxy
sidecar.istio.io/status:
{"version":"3c9617ff82c9962a58890e4fa987c69ca62487fda71c23f3a2aad1d7bb46c748","initContainers":["istio-init"],"containers":["istio-proxy"]...
Status:         Running
IP:             10.44.0.5
Controlled By:  ReplicaSet/myapp-64c59c94dc
Init Containers:
  istio-init:
    Container ID:  docker://a417695f99509707d0f4bfa45d7d491501228031996b603c22aaf398551d1e45
    Image:         gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0
    Image ID:      docker-pullable://gcr.io/gke-release/istio/proxy_init@sha256:e30d47d2f269347a973523d0c5d7540dbf7f87d24aca2737ebc09dbe5be53134
    Port:          <none>
    Host Port:     <none>
    Args:
      -p
      15001
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      8089,
  -d

State:          Terminated
  Reason:       Completed
  Exit Code:    0
  Started:      Sat, 19 Jan 2019 14:33:19 +0100
Finished:     Sat, 19 Jan 2019 14:33:19 +0100
Ready:          True
  Restart Count:  0
  Environment:    <none>
  Mounts:         <none>
Containers:
  app:
    Container ID:   docker://888888888888888888888888
    Image:          gcr.io/myproject/firstapp:v1
    Image ID:       docker-pullable://gcr.io/myproject/firstapp@sha256:8888888888888888888888888
    Port:           8089/TCP
    Host Port:      0/TCP
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 19 Jan 2019 14:40:14 +0100
Finished:     Sat, 19 Jan 2019 14:40:37 +0100
Last State:     Terminated
  Reason:       Completed
  Exit Code:    0
  Started:      Sat, 19 Jan 2019 14:39:28 +0100
Finished:     Sat, 19 Jan 2019 14:39:46 +0100
Ready:          False
  Restart Count:  3
  Requests:
    cpu:      100m
  Readiness:  http-get http://:8089/healthz delay=10s timeout=5s period=10s #success=1 #failure=3
  Environment:
    POSTGRES_DB_HOST:      127.0.0.1:5432
    POSTGRES_DB_USER:      <set to the key 'username' in secret 'mysecret'>  Optional: false
    POSTGRES_DB_PASSWORD:  <set to the key 'password' in secret 'mysecret'>  Optional: false
  Mounts:
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-rclsf (ro)
  cloudsql-proxy:
    Container ID:  docker://788888888888888888888888888
    Image:         gcr.io/cloudsql-docker/gce-proxy:1.11
    Image ID:      docker-pullable://gcr.io/cloudsql-docker/gce-proxy@sha256:5c690349ad8041e8b21eaa63cb078cf13188568e0bfac3b5a914da3483079e2b
    Port:          <none>
    Host Port:     <none>
    Command:
      /cloud_sql_proxy
      -instances=myproject:europe-west4:osm=tcp:5432
      -credential_file=/secrets/cloudsql/credentials.json
    State:          Running
      Started:      Sat, 19 Jan 2019 14:33:40 +0100
Ready:          True
  Restart Count:  0
  Requests:
    cpu:        100m
  Environment:  <none>
  Mounts:
    /secrets/cloudsql from cloudsql-instance-credentials (ro)
    /var/run/secrets/kubernetes.io/serviceaccount from default-token-rclsf (ro)
  istio-proxy:
    Container ID:  docker://f3873d0f69afde23e85d6d6f85b1f
    Image:         gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0
    Image ID:      docker-pullable://gcr.io/gke-release/istio/proxyv2@sha256:826ef4469e4f1d4cabd0dc846
    Port:          <none>
    Host Port:     <none>
    Args:
      proxy
      sidecar
      --configPath
      /etc/istio/proxy
      --binaryPath
      /usr/local/bin/envoy
      --serviceCluster
      myapp
      --drainDuration
      45s
      --parentShutdownDuration
      1m0s
      --discoveryAddress
      istio-pilot.istio-system:15007
      --discoveryRefreshDelay
      1s
      --zipkinAddress
      zipkin.istio-system:9411
      --connectTimeout
      10s
      --statsdUdpAddress
      istio-statsd-prom-bridge.istio-system:9125
      --proxyAdminPort
      15000
      --controlPlaneAuthPolicy
      NONE
    State:          Running
      Started:      Sat, 19 Jan 2019 14:33:54 +0100
Ready:          True
  Restart Count:  0
  Requests:
    cpu:  10m
  Environment:
    POD_NAME:                      myapp-64c59c94dc-8888 (v1:metadata.name)
    POD_NAMESPACE:                 default (v1:metadata.namespace)
    INSTANCE_IP:                    (v1:status.podIP)
    ISTIO_META_POD_NAME:           myapp-64c59c94dc-8888 (v1:metadata.name)
    ISTIO_META_INTERCEPTION_MODE:  REDIRECT
  Mounts:
    /etc/certs/ from istio-certs (ro)
    /etc/istio/proxy from istio-envoy (rw)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  cloudsql-instance-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cloudsql-instance-credentials
    Optional:    false
  default-token-rclsf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rclsf
    Optional:    false
  istio-envoy:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  Memory
  istio-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  istio.default
    Optional:    true
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
  node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                 Age                   From                                                        Message
  ----     ------                 ----                  ----                                                        -------
  Normal   Scheduled              7m31s                 default-scheduler                                           Successfully assigned myapp-64c59c94dc-tdb9c to gke-standard-cluster-1-default-pool-65b9e650-9vtk
  Normal   SuccessfulMountVolume  7m31s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  MountVolume.SetUp succeeded for volume "istio-envoy"
  Normal   SuccessfulMountVolume  7m31s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  MountVolume.SetUp succeeded for volume "cloudsql-instance-credentials"
  Normal   SuccessfulMountVolume  7m31s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  MountVolume.SetUp succeeded for volume "default-token-rclsf"
  Normal   SuccessfulMountVolume  7m31s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  MountVolume.SetUp succeeded for volume "istio-certs"
  Normal   Pulling                7m30s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  pulling image "gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0"
  Normal   Pulled                 7m25s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Successfully pulled image "gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0"
  Normal   Created                7m24s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Created container
  Normal   Started                7m23s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Started container
  Normal   Pulling                7m4s                  kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  pulling image "gcr.io/cloudsql-docker/gce-proxy:1.11"
  Normal   Pulled                 7m3s                  kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Successfully pulled image "gcr.io/cloudsql-docker/gce-proxy:1.11"
  Normal   Started                7m2s                  kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Started container
  Normal   Pulling                7m2s                  kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  pulling image "gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0"
  Normal   Created                7m2s                  kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Created container
  Normal   Pulled                 6m54s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Successfully pulled image "gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0"
  Normal   Created                6m51s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Created container
  Normal   Started                6m48s                 kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Started container
  Normal   Pulling                111s (x2 over 7m22s)  kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  pulling image "gcr.io/myproject/firstapp:v3"
  Normal   Created                110s (x2 over 7m4s)   kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Created container
  Normal   Started                110s (x2 over 7m4s)   kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Started container
  Normal   Pulled                 110s (x2 over 7m7s)   kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Successfully pulled image "gcr.io/myproject/firstapp:v3"
  Warning  Unhealthy              99s                   kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  BackOff                85s                   kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk  Back-off restarting failed container

And:

$ kubectl logs myapp-8888 myapp

> [email protected] start /usr/src/app
> node src/

info: Feathers application started on http://localhost:8089

And the database logs (which looks ok, as some 'startup script entries' from app can be retrieved using psql):

    $ kubectl logs myapp-8888 cloudsql-proxy
  2019/01/19 13:33:40 using credential file for authentication; [email protected]
    2019/01/19 13:33:40 Listening on 127.0.0.1:5432 for myproject:europe-west4:osm
    2019/01/19 13:33:40 Ready for new connections
    2019/01/19 13:33:54 New connection for "myproject:europe-west4:osm"
    2019/01/19 13:33:55 couldn't connect to "myproject:europe-west4:osm": Post https://www.googleapis.com/sql/v1beta4/projects/myproject/instances/osm/createEphemeral?alt=json: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp 74.125.143.95:443: getsockopt: connection refused
    2019/01/19 13:39:06 New connection for "myproject:europe-west4:osm"
    2019/01/19 13:39:06 New connection for "myproject:europe-west4:osm"
    2019/01/19 13:39:06 Client closed local connection on 127.0.0.1:5432
    2019/01/19 13:39:13 New connection for "myproject:europe-west4:osm"
    2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"
    2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"
    2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"

EDIT: Here is the serverside log of the 503 of websocket calls to my app:

{
  insertId:  "465nu9g3xcn5hf"
    jsonPayload: {
      apiClaims:  ""
        apiKey:  ""
        clientTraceId:  ""
        connection_security_policy:  "unknown"
        destinationApp:  "myapp"
        destinationIp:  "10.44.XX.XX"
        destinationName:  "myapp-888888-88888"
        destinationNamespace:  "default"
        destinationOwner:  "kubernetes://apis/extensions/v1beta1/namespaces/default/deployments/myapp"
        destinationPrincipal:  ""
        destinationServiceHost:  "myapp.default.svc.cluster.local"
        destinationWorkload:  "myapp"
        httpAuthority:  "35.204.XXX.XXX"
        instance:  "accesslog.logentry.istio-system"
        latency:  "1.508885ms"
        level:  "info"
        method:  "GET"
        protocol:  "http"
        receivedBytes:  787
        referer:  ""
        reporter:  "source"
        requestId:  "bb31d922-8f5d-946b-95c9-83e4c022d955"
        requestSize:  0
        requestedServerName:  ""
        responseCode:  503
        responseSize:  57
        responseTimestamp:  "2019-01-18T20:53:03.966513Z"
        sentBytes:  164
        sourceApp:  "istio-ingressgateway"
        sourceIp:  "10.44.X.X"
        sourceName:  "istio-ingressgateway-8888888-88888"
        sourceNamespace:  "istio-system"
        sourceOwner:  "kubernetes://apis/extensions/v1beta1/namespaces/istio-system/deployments/istio-ingressgateway"
        sourcePrincipal:  ""
        sourceWorkload:  "istio-ingressgateway"
        url:  "/socket.io/?EIO=3&transport=websocket"
        userAgent:  "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
        xForwardedFor:  "10.44.X.X"
    }
    logName:  "projects/myproject/logs/stdout"
    metadata: {
      systemLabels: {
        container_image:  "gcr.io/gke-release/istio/mixer:1.0.2-gke.0"
          container_image_id:  "docker-pullable://gcr.io/gke-release/istio/mixer@sha256:888888888888888888888888888888"
          name:  "mixer"
          node_name:  "gke-standard-cluster-1-default-pool-88888888888-8887"
          provider_instance_id:  "888888888888"
          provider_resource_type:  "gce_instance"
          provider_zone:  "europe-west4-a"
          service_name: [
            0:  "istio-telemetry"
          ]
          top_level_controller_name:  "istio-telemetry"
          top_level_controller_type:  "Deployment"
      }
        userLabels: {
          app:  "telemetry"
            istio:  "mixer"
            istio-mixer-type:  "telemetry"
            pod-template-hash:  "88888888888"
        }
    }
    receiveTimestamp:  "2019-01-18T20:53:08.135805255Z"
    resource: {
      labels: {
        cluster_name:  "standard-cluster-1"
          container_name:  "mixer"
          location:  "europe-west4-a"
          namespace_name:  "istio-system"
          pod_name:  "istio-telemetry-8888888-8888888"
          project_id:  "myproject"
      }
        type:  "k8s_container"
    }
    severity:  "INFO"
    timestamp:  "2019-01-18T20:53:03.965100Z"
}

In the browser at first it properly seems to switch protocol but then causes a repeated 503 response and subsequent health issues cause a repeating restart. The protocol switch websocket call:

General:

Request URL: ws://localhost:8080/sockjs-node/842/s4888/websocket
Request Method: GET
Status Code: 101 Switching Protocols [GREEN]

Response headers:

Connection: Upgrade
Sec-WebSocket-Accept: NS8888888888888888888
Upgrade: websocket

Request headers:

Accept-Encoding: gzip, deflate, br
Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: Upgrade
Cookie: _ga=GA1.1.1118102238.18888888; hblid=nSNQ2mS8888888888888; olfsk=ol8888888888
Host: localhost:8080
Origin: http://localhost:8080
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: b8zkVaXlEySHasCkD4aUiw==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1

Its frames: enter image description here

Following the above I get multiple of these:

Chrome output regarding websocket call:

general:

Request URL: ws://35.204.210.134/socket.io/?EIO=3&transport=websocket
Request Method: GET
Status Code: 503 Service Unavailable

response headers:

connection: close
content-length: 19
content-type: text/plain
date: Sat, 19 Jan 2019 14:06:39 GMT
server: envoy

request headers:

Accept-Encoding: gzip, deflate
Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: Upgrade
Host: 35.204.210.134
Origin: http://localhost:8080
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: VtKS5xKF+GZ4u3uGih2fig==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1

The frames:

Data: (Opcode -1)
Length: 63
Time: 15:06:44.412

Solution

  • Using uws (uWebSockets) as websocket engine causes these errors. When I swap in my backend app this code:

    app.configure(socketio({
      wsEngine: 'uws',
      timeout: 120000,
      reconnect: true
    }))
    

    for this:

    app.configure(socketio())
    

    Everything works as expected.

    EDIT: Now it also works with uws. I used alpine docker container which is based on node 10, which does not work with uws. After switching to container based on node 8 it works.