I have a bokeh dashboard served in a docker container, which is running on kubernetes. I can access my dashboard remotely, no problems. But I noticed my pod containing the bokeh serve code restarts a lot, i.e. 14 times in the past 2 hours. Sometimes the status will come back as 'CrashLoopBackOff' and sometimes it will be 'Running' normally.
My question is, is there something about the way bokeh serve works that requires kubernetes to restart it so frequently? Is it something to do with memory (OOMKilled)?
Here is a section of my describe pod:
Name: bokeh-744d4bc9d-5pkzq
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: 10.183.226.51/10.183.226.51
Start Time: Tue, 18 Feb 2020 11:55:44 +0000
Labels: name=bokeh
pod-template-hash=744d4bc9d
Annotations: kubernetes.io/psp: xyz-privileged-psp
Status: Running
IP: 172.30.255.130
Controlled By: ReplicaSet/bokeh-744d4bc9d
Containers:
dashboard-application:
Container ID: containerd://16d10dc5dd89235b0xyz2b5b31f8e313f3f0bb7efe82a12e00c1f01708e2f894
Image: us.icr.io/oss-data-science-np-dal/bokeh:118
Image ID: us.icr.io/oss-data-science-np-dal/bokeh@sha256:037a5b52a6e7c792fdxy80b01e29772dbfc33b10e819774462bee650cf0da
Port: 5006/TCP
Host Port: 0/TCP
State: Running
Started: Tue, 18 Feb 2020 14:25:36 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 18 Feb 2020 14:15:26 +0000
Finished: Tue, 18 Feb 2020 14:23:54 +0000
Ready: True
Restart Count: 17
Limits:
cpu: 800m
memory: 600Mi
Requests:
cpu: 600m
memory: 400Mi
Liveness: http-get http://:5006/ delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:5006/ delay=10s timeout=1s period=3s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cjhfk (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-cjhfk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cjhfk
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 600s
node.kubernetes.io/unreachable:NoExecute for 600s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 36m (x219 over 150m) kubelet, 10.183.226.51 Liveness probe failed: Get http://172.30.255.130:5006/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 21m (x34 over 134m) kubelet, 10.183.226.51 Back-off restarting failed container
Warning Unhealthy 10m (x72 over 150m) kubelet, 10.183.226.51 Readiness probe failed: Get http://172.30.255.130:5006/RCA: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 6m4s (x957 over 150m) kubelet, 10.183.226.51 Readiness probe failed: Get http://172.30.255.130:5006/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 68s (x23 over 147m) kubelet, 10.183.226.51 Liveness probe failed: Get http://172.30.255.130:5006/RCA: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I'm new to k8s, so any information you have to spare on this kind of issue will be much appreciated!
If a Container allocates more memory than its limit, the Container becomes a candidate for termination. If the Container continues to consume memory beyond its limit, the Container is terminated. If a terminated Container can be restarted, the kubelet restarts it, as with any other type of runtime failure. This is documented here.
You may have to increase limits and requests in your pod spec. Check the official doc here.
Other way to look at it is to try to optimize your code so that it does not exceed the memory specified in limits.