Search code examples
ceph

Monitor daemon running but not in quorum


I'm currently testing OS and version upgrades for a ceph cluster. Starting info: The cluster is currently on Centos 7 and Ceph version Nautilus. I'm trying to change OS with ubuntu 20.04 and version with Octopus. I started with upgrading mon1 first. I will write down the things done in order.

First of I stopped monitor service - systemctl stop ceph-mon@mon1

Then I removed the monitor from cluster - ceph mon remove mon1

Then installed ubuntu 20.04 on mon1. Updated the system and configured ufw.

Installed ceph octopus packages.

Copied ceph.client.admin.keyring and ceph.conf to mon1 /etc/ceph/

Copied ceph.mon.keyring to mon1 to a temporary folder and changed ownership to ceph:ceph

Got the monmap ceph mon getmap -o ${MONMAP} - The thing is i did this after removing the monitor.

Created /var/lib/ceph/mon/ceph-mon1 folder and changed ownership to ceph:ceph

Created the filesystem for monitor - sudo -u ceph ceph-mon --mkfs -i mon1 --monmap /folder/monmap --keyring /folder/ceph.mon.keyring

After noticing I got the monmap after the monitors removal I added it manually - ceph mon add mon1 <ip> --fsid <fsid>

After starting manually and checking cluster state with ceph -s I can see mon1 is listed but is not in quorum. The monitor daemon runs fine on the said mon1 node. I noticed on logs that mon1 is stuck in "probe" state and on other monitor logs there is an output such as mon1 (rank 2) addr [v2:<ip>:3300/0,v1:<ip>:6789/0] is down (out of quorum) , as i said the the monitor daemon is running on mon1 without any visible errors just stuck in probe state.

I wondered if it was caused by os&version change so i first tried out configuring manager, mds and radosgw daemons by creating the respective folders in /var/lib/ceph/... and copying keyrings. All these services work fine, i was able to reach to my buckets, was able to open the Octopus version dashboard, and metadata server is listed as active in ceph -s. So evidently my problem is only with monitor configuration.

After doing some checking found this on red hat ceph documantation:

If the Ceph Monitor is in the probing state longer than expected, it cannot find the other Ceph Monitors. This problem can be caused by networking issues, or the Ceph Monitor can have an outdated Ceph Monitor map (monmap) and be trying to reach the other Ceph Monitors on incorrect IP addresses. Alternatively, if the monmap is up-to-date, Ceph Monitor’s clock might not be synchronized.

There is no network error on the monitor, I can reach all the other machines in the cluster. The clocks are synchronized. If this problem is caused by the monmap situation how can I fix this?


Solution

  • Ok so as a result, directly from centos7-Nautilus to ubuntu20.04-Octopus is not possible for monitor services only, apparently the issue is about hostname resolution with different Operating systems. The rest of the services is fine. There is a longer way to do this without issue and is the correct solution. First change os from centos7 to ubuntu18.04 and install ceph-nautilus packages and add the machines to cluster (no issues at all). Then update&upgrade the system and apply "do-release-upgrade". Works like a charm. I think what eblock mentioned was this.