Search code examples
kuberneteslinux-kernelcephcephfsrook-storage

kubernetes nodes keep rebooting when using rook volumes


Several days ago I faced a problem when my nodes were rebooting constantly

My stack:

  • 1 master, 2 workers k8s-cluster built with kubeadm (v1.17.1-00)

  • Ubuntu 18.04 x86_64 4.15.0-74-generic

  • Flannel cni plugin (v0.11.0)

  • Rook (v1.2) cephfs for storage. Ceph was deployed in the same cluster, where my application lives

I was able to run ceph cluster, but when I tried to deploy my application, that was using my rook-volumes, suddenly my pods were starting to die

I got this message when I used kubectl describe pods/name command:

Pod sandbox changed, it will be killed and re-created

In the k8s events I got:

<Node name> has been rebooted

After some time node comes to life but eventually dies in 2-3 minutes.

I tried to drain my node and connect back to my cluster but after that some another node was getting this error.

I looked into the system error logs of a failed node by command journalctl -p 3.

And found, that logs were flooded with these messages: kernel: cache_from_obj: Wrong slab cache. inode_cache but object is from ceph_inode_info.

After googling this problem, I found this issue: https://github.com/coreos/bugs/issues/2616

It turned out, that cephfs just doesn't work with some versions of Linux kernel!! For me neither of these worked:

  • Ubuntu 19.04 x86_64 5.0.0-32-generic
  • Ubuntu 18.04 x86_64 4.15.0-74-generic

Solution

  • Solution

    Cephfs doesn't work with some versions of Linux kernel. Upgrade your kernel. I finally got it working on Ubuntu 18.04 x86_64 5.0.0-38-generic

    Github issue, that helped me: https://github.com/coreos/bugs/issues/2616

    This is indeed a tricky issue, I was struggling to find a solution, and I spent A LOT of time trying to understand what was happening. I hope this information will help some one, cause there is not so much information on google.