Search code examples
kuberneteskubernetes-helmkubectlpersistent-volumescockroachdb

CockroachDB Cluster on Kubernetes Pods Crashing


I'm trying to install a CockroachDB Helm chart on a 2 node Kubernetes cluster using this command:

helm install my-release --set statefulset.replicas=2 stable/cockroachdb

I have already created 2 persistent volumes:

NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                          STORAGECLASS   REASON   AGE
pv00001   100Gi      RWO            Recycle          Bound    default/datadir-my-release-cockroachdb-0                           11m
pv00002   100Gi      RWO            Recycle          Bound    default/datadir-my-release-cockroachdb-1                           11m

I'm getting a weird error and I'm new to Kubernetes so I'm not sure what I'm doing wrong. I've tried creating a StorageClass and using it with my PVs but then the CockroachDB PVCs won't bind to them. I suspect there may be something wrong with my PV setup?

I've tried using kubectl logs but the only error I'm seeing is this:

standard_init_linux.go:211: exec user process caused "exec format error"

and the pods are crashing over and over:

NAME                                    READY   STATUS             RESTARTS   AGE
my-release-cockroachdb-0            0/1     Pending            0          11m
my-release-cockroachdb-1            0/1     CrashLoopBackOff   7          11m
my-release-cockroachdb-init-tfcks   0/1     CrashLoopBackOff   5          5m29s

Any idea why the pods are crashing?

Here's kubectl describe for the init pod:

Name:         my-release-cockroachdb-init-tfcks
Namespace:    default
Priority:     0
Node:         axon/192.168.1.7
Start Time:   Sat, 04 Apr 2020 00:22:19 +0100
Labels:       app.kubernetes.io/component=init
              app.kubernetes.io/instance=my-release
              app.kubernetes.io/name=cockroachdb
              controller-uid=54c7c15d-eb1c-4392-930a-d9b8e9225a45
              job-name=my-release-cockroachdb-init
Annotations:  <none>
Status:       Running
IP:           10.44.0.1
IPs:
  IP:           10.44.0.1
Controlled By:  Job/my-release-cockroachdb-init
Containers:
  cluster-init:
    Container ID:  docker://82a062c6862a9fd5047236feafe6e2654ec1f6e3064fd0513341a1e7f36eaed3
    Image:         cockroachdb/cockroach:v19.2.4
    Image ID:      docker-pullable://cockroachdb/cockroach@sha256:511b6d09d5bc42c7566477811a4e774d85d5689f8ba7a87a114b96d115b6149b
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      while true; do initOUT=$(set -x; /cockroach/cockroach init --insecure --host=my-release-cockroachdb-0.my-release-cockroachdb:26257 2>&1); initRC="$?"; echo $initOUT; [[ "$initRC" == "0" ]] && exit 0; [[ "$initOUT" == *"cluster has already been initialized"* ]] && exit 0; sleep 5; done
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 04 Apr 2020 00:28:04 +0100
      Finished:     Sat, 04 Apr 2020 00:28:04 +0100
    Ready:          False
    Restart Count:  6
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cz2sn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-cz2sn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-cz2sn
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  <unknown>             default-scheduler  Successfully assigned default/my-release-cockroachdb-init-tfcks to axon
  Normal   Pulled     5m9s (x5 over 6m45s)  kubelet, axon      Container image "cockroachdb/cockroach:v19.2.4" already present on machine
  Normal   Created    5m8s (x5 over 6m45s)  kubelet, axon      Created container cluster-init
  Normal   Started    5m8s (x5 over 6m44s)  kubelet, axon      Started container cluster-init
  Warning  BackOff    92s (x26 over 6m42s)  kubelet, axon      Back-off restarting failed container

Solution

  • When Pods get crashed, the most important thing to troubleshoot is their descriptions(kubectl describe) and logs.

    Logs of the failed Pod show that the arch of the cockroach image doesn't match to the nodes.

    Run kubectl get po -o wide to get nodes where cockroach runs and check their arch.