Search code examples
kubernetesnsq

Trying to create NSQ PetSet, pods keep terminating shortly after container launches


Full yaml file here (not embedded in question because it's rather long and because much of the important bits are covered by the describe below):

https://gist.github.com/sporkmonger/46a820f9a1ed8a73d89a319dffb24608

Using a public container image I created here: sporkmonger/nsq-k8s:0.3.8

Container is identical to the official NSQ image, but using Debian Jessie instead of Alpine/musl to solve DNS issues that tend to be a problem for Alpine-on-Kubernetes.

Here's what happens when I describe one of the pods:

❯ kubectl describe pod nsqd-0
Name:               nsqd-0
Namespace:          default
Node:               minikube/192.168.99.100
Start Time:         Sun, 04 Dec 2016 20:58:06 -0800
Labels:             app=nsq
Status:             Terminating (expires Sun, 04 Dec 2016 21:02:31 -0800)
Termination Grace Period:   60s
IP:             172.17.0.8
Controllers:            PetSet/nsqd
Containers:
  nsqd:
    Container ID:   docker://381e4a1313e4e13a63b8a17004d79a6e828a8bc1c9e20419b319f8a9757f266b
    Image:      sporkmonger/nsq-k8s:0.3.8
    Image ID:       docker://sha256:01691a91cee3e1a6992b33a10e99baa57c5b8ce7b765849540a830f0b554e707
    Ports:      4150/TCP, 4151/TCP
    Command:
      /bin/sh
      -c
    Args:
      /usr/local/bin/nsqd
      -data-path
      /data
      -broadcast-address
      $(hostname -f)
      -lookupd-tcp-address
      nsqlookupd-0.nsqlookupd.default.svc.cluster.local:4160
      -lookupd-tcp-address
      nsqlookupd-1.nsqlookupd.default.svc.cluster.local:4160
      -lookupd-tcp-address
      nsqlookupd-2.nsqlookupd.default.svc.cluster.local:4160
    State:      Running
      Started:      Sun, 04 Dec 2016 20:58:11 -0800
    Ready:      True
    Restart Count:  0
    Liveness:       http-get http://:http/ping delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/ping delay=1s timeout=1s period=10s #success=1 #failure=3
    Volume Mounts:
      /data from datadir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-k6ufj (ro)
    Environment Variables:  <none>
Conditions:
  Type      Status
  Initialized   True 
  Ready     True 
  PodScheduled  True 
Volumes:
  datadir:
    Type:   PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-nsqd-0
    ReadOnly:   false
  default-token-k6ufj:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-k6ufj
QoS Class:  BestEffort
Tolerations:    <none>
Events:
  FirstSeen LastSeen    Count   From            SubobjectPath       Type        Reason      Message
  --------- --------    -----   ----            -------------       --------    ------      -------
  4m        4m      1   {default-scheduler }                Normal      Scheduled   Successfully assigned nsqd-0 to minikube
  4m        4m      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Pulling     pulling image "sporkmonger/nsq-k8s:0.3.8"
  4m        4m      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Pulled      Successfully pulled image "sporkmonger/nsq-k8s:0.3.8"
  4m        4m      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Created     Created container with docker id 381e4a1313e4; Security:[seccomp=unconfined]
  4m        4m      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Started     Started container with docker id 381e4a1313e4
  0s        0s      1   {kubelet minikube}  spec.containers{nsqd}   Normal      Killing     Killing container with docker id 381e4a1313e4: Need to kill pod.

A fairly representative watch of about 30 seconds of cluster activity:

❯ kubectl get pods -w
NAME           READY     STATUS        RESTARTS   AGE
nsqadmin-0     1/1       Running       3          33m
nsqadmin-1     1/1       Running       0          32m
nsqd-0         1/1       Running       0          6m
nsqd-1         1/1       Running       0          4m
nsqd-2         1/1       Terminating   0          1m
nsqd-3         1/1       Running       0          30s
nsqlookupd-0   1/1       Running       0          30s
NAME           READY     STATUS    RESTARTS   AGE
nsqlookupd-1   0/1       Pending   0          0s
nsqlookupd-1   0/1       Pending   0         0s
nsqlookupd-1   0/1       ContainerCreating   0         0s
nsqlookupd-1   0/1       Running   0         4s
nsqlookupd-1   1/1       Running   0         8s
nsqlookupd-2   0/1       Pending   0         0s
nsqlookupd-2   0/1       Pending   0         0s
nsqlookupd-2   0/1       ContainerCreating   0         0s
nsqlookupd-2   0/1       Terminating   0         0s
nsqd-2    0/1       Terminating   0         2m
nsqd-2    0/1       Terminating   0         2m
nsqd-2    0/1       Terminating   0         2m
nsqlookupd-2   0/1       Terminating   0         4s
nsqlookupd-2   0/1       Terminating   0         5s
nsqlookupd-2   0/1       Terminating   0         5s
nsqlookupd-2   0/1       Terminating   0         5s
nsqlookupd-1   1/1       Terminating   0         29s
nsqlookupd-1   0/1       Terminating   0         30s
nsqlookupd-1   0/1       Terminating   0         30s
nsqlookupd-1   0/1       Terminating   0         30s
nsqlookupd-0   1/1       Terminating   0         1m
nsqd-2    0/1       Pending   0         0s
nsqd-2    0/1       Pending   0         0s
nsqd-2    0/1       ContainerCreating   0         0s
nsqlookupd-0   0/1       Terminating   0         1m
nsqlookupd-0   0/1       Terminating   0         1m
nsqlookupd-0   0/1       Terminating   0         1m
nsqlookupd-0   0/1       Pending   0         0s
nsqlookupd-0   0/1       Pending   0         0s
nsqlookupd-0   0/1       ContainerCreating   0         0s
nsqd-2    0/1       Running   0         4s
nsqlookupd-0   0/1       Running   0         4s
nsqd-2    1/1       Running   0         6s
nsqlookupd-0   1/1       Running   0         10s
nsqlookupd-0   1/1       Terminating   0         10s
nsqlookupd-0   0/1       Terminating   0         11s
nsqlookupd-0   0/1       Terminating   0         11s
nsqlookupd-0   0/1       Terminating   0         11s
nsqd-2    1/1       Terminating   0         12s
nsqlookupd-0   0/1       Pending   0         0s
nsqlookupd-0   0/1       Pending   0         0s
nsqlookupd-0   0/1       ContainerCreating   0         0s
nsqlookupd-0   0/1       Running   0         3s
nsqlookupd-0   1/1       Running   0         10s

Typical container logs:

❯ kubectl logs nsqd-0
[nsqd] 2016/12/05 05:21:34.666963 nsqd v0.3.8 (built w/go1.6.2)
[nsqd] 2016/12/05 05:21:34.667170 ID: 794
[nsqd] 2016/12/05 05:21:34.667200 NSQ: persisting topic/channel metadata to nsqd.794.dat
[nsqd] 2016/12/05 05:21:34.669232 TCP: listening on [::]:4150
[nsqd] 2016/12/05 05:21:34.669284 HTTP: listening on [::]:4151
[nsqd] 2016/12/05 05:21:35.896901 200 GET /ping (172.17.0.1:51322) 1.511µs
[nsqd] 2016/12/05 05:21:40.290550 200 GET /ping (172.17.0.1:51392) 2.167µs
[nsqd] 2016/12/05 05:21:40.304599 200 GET /ping (172.17.0.1:51394) 1.856µs
[nsqd] 2016/12/05 05:21:50.289018 200 GET /ping (172.17.0.1:51452) 1.865µs
[nsqd] 2016/12/05 05:21:50.299567 200 GET /ping (172.17.0.1:51454) 1.951µs
[nsqd] 2016/12/05 05:22:00.296685 200 GET /ping (172.17.0.1:51548) 2.029µs
[nsqd] 2016/12/05 05:22:00.300842 200 GET /ping (172.17.0.1:51550) 1.464µs
[nsqd] 2016/12/05 05:22:10.295596 200 GET /ping (172.17.0.1:51698) 2.206µs

I'm totally scratching my head here on why Kubernetes keeps killing these pods. The containers themselves don't seem to be misbehaving and kubernetes itself seems to be terminating things here...


Solution

  • Figured it out.

    My services all have the same selector. Each service matches all pods, causing Kubernetes to think it's got too many of each running at once, so it's killing the "extras" at random.