Search code examples
kubernetesbokeh

Is it normal for bokeh serve on Kubernetes to restart periodically?


I have a bokeh dashboard served in a docker container, which is running on kubernetes. I can access my dashboard remotely, no problems. But I noticed my pod containing the bokeh serve code restarts a lot, i.e. 14 times in the past 2 hours. Sometimes the status will come back as 'CrashLoopBackOff' and sometimes it will be 'Running' normally.

My question is, is there something about the way bokeh serve works that requires kubernetes to restart it so frequently? Is it something to do with memory (OOMKilled)?

Here is a section of my describe pod:

Name:               bokeh-744d4bc9d-5pkzq
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               10.183.226.51/10.183.226.51
Start Time:         Tue, 18 Feb 2020 11:55:44 +0000
Labels:             name=bokeh
                    pod-template-hash=744d4bc9d
Annotations:        kubernetes.io/psp: xyz-privileged-psp
Status:             Running
IP:                 172.30.255.130
Controlled By:      ReplicaSet/bokeh-744d4bc9d
Containers:
  dashboard-application:
    Container ID:   containerd://16d10dc5dd89235b0xyz2b5b31f8e313f3f0bb7efe82a12e00c1f01708e2f894
    Image:          us.icr.io/oss-data-science-np-dal/bokeh:118
    Image ID:       us.icr.io/oss-data-science-np-dal/bokeh@sha256:037a5b52a6e7c792fdxy80b01e29772dbfc33b10e819774462bee650cf0da
    Port:           5006/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 18 Feb 2020 14:25:36 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 18 Feb 2020 14:15:26 +0000
      Finished:     Tue, 18 Feb 2020 14:23:54 +0000
    Ready:          True
    Restart Count:  17
    Limits:
      cpu:     800m
      memory:  600Mi
    Requests:
      cpu:        600m
      memory:     400Mi
    Liveness:     http-get http://:5006/ delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:5006/ delay=10s timeout=1s period=3s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cjhfk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-cjhfk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-cjhfk
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 600s
                 node.kubernetes.io/unreachable:NoExecute for 600s
Events:
  Type     Reason     Age                    From                    Message
  ----     ------     ----                   ----                    -------
  Warning  Unhealthy  36m (x219 over 150m)   kubelet, 10.183.226.51  Liveness probe failed: Get http://172.30.255.130:5006/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff    21m (x34 over 134m)    kubelet, 10.183.226.51  Back-off restarting failed container
  Warning  Unhealthy  10m (x72 over 150m)    kubelet, 10.183.226.51  Readiness probe failed: Get http://172.30.255.130:5006/RCA: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  6m4s (x957 over 150m)  kubelet, 10.183.226.51  Readiness probe failed: Get http://172.30.255.130:5006/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  68s (x23 over 147m)    kubelet, 10.183.226.51  Liveness probe failed: Get http://172.30.255.130:5006/RCA: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I'm new to k8s, so any information you have to spare on this kind of issue will be much appreciated!


Solution

  • If a Container allocates more memory than its limit, the Container becomes a candidate for termination. If the Container continues to consume memory beyond its limit, the Container is terminated. If a terminated Container can be restarted, the kubelet restarts it, as with any other type of runtime failure. This is documented here.

    You may have to increase limits and requests in your pod spec. Check the official doc here.

    Other way to look at it is to try to optimize your code so that it does not exceed the memory specified in limits.