Search code examples
kubernetesetcd

Corrupted file in ETCD causes kubernetes not to start


After a node (and master) reboot, a file in etcd became corrupted:

my_node: ~ # cd /var/lib/etcd/member/snap/
my_node: snap # ls -lsa
ls: could not access 0000000000000005-00000000008cb33c.snap: input/output error
totale 5040
   4 drwx------ 2 root root     4096  3 apr 11.20 .
   4 drwx------ 4 root root     4096  3 apr 11.20 ..
   8 -rw-r--r-- 1 root root     8177  2 apr 14.14 0000000000000005-00000000008c3e09.snap
   8 -rw-r--r-- 1 root root     8177  2 apr 16.31 0000000000000005-00000000008c651a.snap
   8 -rw-r--r-- 1 root root     8177  2 apr 18.48 0000000000000005-00000000008c8c2b.snap
   ? -????????? ? ?    ?           ?            ? 0000000000000005-00000000008cb33c.snap
   8 -rw-r--r-- 1 root root     8177  1 apr 20.01 0000000000000005-00000000008cda4d.snap.broken
5000 -rw------- 1 root root 16805888  2 apr 07.20 db

Container with ETCD shows a panic error:

2018-04-03 09:20:23.578267 W | snap: cannot rename broken snapshot file /var/lib/etcd/member/snap/0000000000000005-00000000008cb33c.snap to /var/lib/etcd/member/snap/0000000000000005-00000000008cb33c.snap.broken: rename /var/lib/etcd/member/snap/0000000000000005-00000000008cb33c.snap /var/lib/etcd/member/snap/0000000000000005-00000000008cb33c.snap.broken: input/output error
2018-04-03 09:20:23.579220 I | etcdserver: recovered store from snapshot at index 9210923
2018-04-03 09:20:23.579250 I | etcdserver: name = default
2018-04-03 09:20:23.579257 I | etcdserver: data dir = /var/lib/etcd
2018-04-03 09:20:23.579263 I | etcdserver: member dir = /var/lib/etcd/member
2018-04-03 09:20:23.579269 I | etcdserver: heartbeat = 100ms
2018-04-03 09:20:23.579273 I | etcdserver: election = 1000ms
2018-04-03 09:20:23.579278 I | etcdserver: snapshot count = 10000
2018-04-03 09:20:23.579294 I | etcdserver: advertise client URLs = http://127.0.0.1:2379
2018-04-03 09:20:23.579714 I | etcdserver: restarting member 0 in cluster 0 at commit index 0
panic: cannot use none as id

goroutine 1 [running]: ...

I am running a single node cluster.

What's the best strategy to face this problem? Any suggestions are welcomed.


Solution

  • That is not a problem with Kubernetes or etcd itself, it can happen with any application trying to write a file in the time when the server is rebooting.

    The root cause of the problem is a broken file in the filesystem. I do not know which FS you are using, but in most of the cases, that kind of errors should be fixed by a system on next boot, but if it cannot - that mean problem was serious.

    What I can suggest for you:

    1. Do not reboot a VM or a server until all important software is stopped. Especially something like EtcD which requires strong data consistency. If you need to reboot something like "single node" Kubernetes cluster - stop Nodes first, then stop Master. And do not force it, give them time to shutdown. When you are rebooting a server, all apps have limited time for shutdown. If they will not, OS will just kill them, which can be a root of your problem.

    2. Use Journaled File Systems like ext4 or ReiserFS, they can automatically recover some corrupted files and metadata.

    3. Use applications in clustering mode if it is possible. For example, with 3 nodes of etcd, you will not lose your data if one node will have a problem.