Search code examples
ceph

Ceph Monitor out of quorum


we're experiencing a problem with one of our ceph monitors. Cluster uses 3 monitors and they are all up&running. They can communicate with each other and gives a relevant ceph -s output. However quorum shows second monitor is down. The ceph -s output from supposedly down monitor is below:

cluster:
    id:     bb1ab46a-d282-4530-bf5c-021e9c940958
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            noout flag(s) set
            9 large omap objects
            47 pgs not deep-scrubbed in time
            application not enabled on 2 pool(s)
            1/3 mons down, quorum mon1,mon3

  services:
    mon:        3 daemons, quorum mon1,mon3 (age 3d), out of quorum: mon2
    mgr:        mon1(active, since 3d)
    mds:        filesystem:1 {0=mon1=up:active}
    osd:        77 osds: 77 up (since 3d), 77 in (since 2w)
                flags noout
    rbd-mirror: 1 daemon active (12512649)
    rgw:        1 daemon active (mon1)

  data:
    pools:   13 pools, 1500 pgs
    objects: 65.36M objects, 23 TiB
    usage:   85 TiB used, 701 TiB / 785 TiB avail
    pgs:     1500 active+clean

  io:
    client:   806 KiB/s wr, 0 op/s rd, 52 op/s wr

systemctl status ceph-mon@2.service shows:

ceph-mon@2.service - Ceph cluster monitor daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Tue 2020-12-08 12:12:58 +03; 28s ago
  Process: 2681 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
 Main PID: 2681 (code=exited, status=1/FAILURE)

Dec 08 12:12:48 mon2 systemd[1]: Unit ceph-mon@2.service entered failed state.
Dec 08 12:12:48 mon2 systemd[1]: ceph-mon@2.service failed.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon@2.service holdoff time over, scheduling restart.
Dec 08 12:12:58 mon2 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: start request repeated too quickly for ceph-mon@2.service
Dec 08 12:12:58 mon2 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: Unit ceph-mon@2.service entered failed state.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon@2.service failed.

Restarting, Stop/Starting, Enable/Disabling the monitor daemon did not work. Docs mention the monitor asok file in var/run/ceph and i don't have it in the supposed directory yet the other monitors have their asok files right in place. And now im in a state that i can't even stop the monitor daemon on second monitor it only stays at failed state. There is no logs shown in /var/log/ceph monitor logs. What am i supposed to do? I don't have much experience in ceph so i don't want to change things without being absolutely sure in order to avoid messing up the cluster.


Solution

  • try to start the service manually on MON2 with:

    /usr/bin/ceph-mon -f --cluster Ceph --id 2 --setuser ceph --setgroup ceph