Search code examples
kuberneteskubernetes-podlivenessprobe

What happens when kubernetes liveness-probe return false?


What happens when Kubernetes liveness-probe returns false? Does Kubernetes restart that pod immediately?


Solution

  • First, please note that livenessProbe concerns containers in the pod, not the pod itself. So if you have multiple containers in one pod, only the affected container will be restarted.

    It's worth noting, that there is parameter failureThreshold, which is set by default to 3. So, after 3 failed probes a container will be restarted:

    failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness probe means restarting the container. In case of readiness probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.

    Ok, we have information that a container is restarted after 3 failed probes - but what does it mean to restart?

    I found a good article about how Kubernetes terminates a pods - Kubernetes best practices: terminating with grace. Seems for container restart caused by liveness probe it's similar - I will share my experience below.

    So basically when container is being terminated by liveness probe steps are:

    So... if an app in your container is catching the SIGTERM signal properly, then the container will shut-down and will be started again. Typically it's happening pretty fast (as I tested for the NGINX image) - almost immediately.

    Situation is different when SIGTERM is not supported by your application. It means after terminationGracePeriodSeconds period the SIGKILL signal is sent, it means the container will be forcibly removed.

    Example below (modified example from this doc) + I set failureThreshold: 1

    I have following pod definition:

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        test: liveness
      name: liveness-exec
    spec:
      containers:
      - name: liveness
        image: nginx
        livenessProbe:
          exec:
            command:
            - cat
            - /tmp/healthy
          periodSeconds: 10
          failureThreshold: 1
    

    Of course there is no /tmp/healthy file, so livenessProbe will fail. The NGINX image is properly catching the SIGTERM signal, so the container will be restarted almost immediately (for every failed probe). Let's check it:

    user@shell:~/liveness-test-short $ kubectl get pods
    NAME                                   READY   STATUS             RESTARTS   AGE
    liveness-exec                          0/1     CrashLoopBackOff   3          36s
    

    So after ~30 sec the container is already restarted a few times and it's status is CrashLoopBackOff as expected. I created the same pod without livenessProbe and I measured the time need to shutdown it:

    user@shell:~/liveness-test-short $ time kubectl delete pod liveness-exec
    pod "liveness-exec" deleted
    
    real    0m1.474s
    

    So it's pretty fast.

    The similar example but I added sleep 3000 command:

    ...
    image: nginx
        args:
        - /bin/sh
        - -c
        - sleep 3000
    ...
    

    Let's apply it and check...

    user@shell:~/liveness-test-short $ kubectl get pods
    NAME                                   READY   STATUS    RESTARTS   AGE
    liveness-exec                          1/1     Running   5          3m37s
    

    So after ~4 min there are only 5 restarts. Why? Because we need to wait for full terminationGracePeriodSeconds period (default is 30 seconds) for every restart. Let's measure time needed to shutdown:

    user@shell:~/liveness-test-short $ time kubectl delete pod liveness-exec
    pod "liveness-exec" deleted
    
    real    0m42.418s
    

    It's much longer.

    To sum up:

    What happens when Kubernetes liveness-probe return false? Does Kubernetes restart that pod immediately?

    The short answer is: by default no. Why?

    • Kubernetes will restart a container in a pod after failureThreshold times. By default it is 3 times - so after 3 failed probes.
    • Depends on your configuration of the container, time needed for container termination could be very differential
    • You can adjust both failureThreshold and terminationGracePeriodSeconds period parameters, so the container will be restarted immediately after every failed probe.