amazon-web-services docker kubernetes devops lifecycle

Lifecycle hooks failing with error 137

I am launching Jobs and I'm trying to use the lifecycle hooks to launch a script at start and another one at shutdown of the container.

I am also specifying resource limits, and they look like this:

resources:
    required:
        memory: 1Gi
        cpu: 1
    limits:
        memory: 1Gi
        cpu: 1

My cluster currently has 4 nodes with 1 CPU and 4 GB of RAM each, and is running on EC2 machines.

The postStart script is at the moment very simple, and looks like this:

export SOME_VAR=some_value
node someScript.js

The only thing the Node script does is update a value on a database, so it's not an especially intensive task.

After launching the job, the following events happen:

As you can see the postStart hook fails with error 137, and gives no error message.

Any help for solving this issue is highly welcome and appreciated.

Edit 1

Since the first answer has pointed to the fact that the command executed for the cook might not be correctly built, I think it's important to say that I build the jobs using the API Kubernetes publishes through kubectl proxy.

This is how I specify the lifecycle instructions:

"lifecycle": {
    "postStart": {
        "exec": {
            "command": [
                "/bin/sh",
                "postStart.sh"
             ]
        }
    },
    "preStop": {
        "exec": {
            "command": [
                "/bin/sh",
                "preStop.sh"
            ]
        }
    }
}

I think this translates to YAML the way it's supposed to; please correct me if I am wrong on this.

Solution

You have 2 problems, so you get 2 answers :-)

Problem 1: too high cpu requirement

You pod specifies the requirement of cpu: 1 - this means 1 cpu core. Your nodes have 1 cpu core in total, but are already running some pods, like kube-proxy. So none of them have a full core available for your application, so the scheduling fails.

The error message No nodes are available that match all of the predicates: Insufficient cpu (4), PodToleratesNodeTaints (1) means:

Scheduling is not possible at the moment
Of all nodes, 4 do not have enough cpu to schedule this pod.
- You can verify this by executing kubectl describe node nameofyournode, and look at the Allocatable: and the Allocated resources: part of the output. In Non-terminated Pods: you will see that is taking up some of your cpu, possibly a kube-proxy pod.
Of all nodes, 1 has a taint that is not tolerated by the pod (this is the master I imagine)

The solution is to lower the requirement for your pod (500mi means 500 millicores, or 0.5 cores):

resources:
  required:
    memory: 1Gi
    cpu: 500mi
  limits:
    memory: 1Gi
    cpu: 500mi

... or resize your machines so they have 2 cores instead of 1.

Problem 2: bad postStart command

Now what is most curious is that somehow in the end the pod did get scheduled, but thereafter killed. Code 126 means Command invoked cannot execute, so the postStart: command is probably invalid. You did not post the full yaml file, but from the error message it looks like you have specified something like:

lifecycle:
  postStart:
    exec:
      command: ["/bin/sh postStart.sh"]

please check if that is the case. If so, it is incorrect. You need to separate each parameter into a different element in the command array like so:

lifecycle:
  postStart:
    exec:
      command: ["/bin/sh", "postStart.sh"]

Alternatively, make sure that postStart.sh is marked executable in the container image and specify a shell shebang in the first line (#!/bin/bash). If you do that you can define the postStart hook like this:

lifecycle:
  postStart:
    exec:
      command: ["/path/to/postStart.sh"]