Why Ceph turns status to Err when there is still available storage space

I built a 3 node Ceph cluster recently. Each node had seven 1TB HDD for OSDs. In total, I have 21 TB of storage space for Ceph.

However, when I ran a workload to keep writing data to Ceph, it turns to Err status and no data can be written to it any more.

The output of ceph -s is:

 cluster:
    id:     06ed9d57-c68e-4899-91a6-d72125614a94
    health: HEALTH_ERR
            1 full osd(s)
            4 nearfull osd(s)
            7 pool(s) full

  services:
    mon: 1 daemons, quorum host3
    mgr: admin(active), standbys: 06ed9d57-c68e-4899-91a6-d72125614a94
    osd: 21 osds: 21 up, 21 in
    rgw: 4 daemons active

  data:
    pools:   7 pools, 1748 pgs
    objects: 2.03M objects, 7.34TiB
    usage:   14.7TiB used, 4.37TiB / 19.1TiB avail
    pgs:     1748 active+clean

Based on my comprehension, since there is still 4.37 TB space left, Ceph itself should take care about how to balance the workload and make each OSD to not be at full or nearfull status. But the result doesn't work as my expectation, 1 full osd and 4 nearfull osd shows up, the health is HEALTH_ERR.

I can't visit Ceph with hdfs or s3cmd anymore, so here comes the question:
1, Any explanation about current issue?
2, How can I recover from it? Delete data on Ceph node directly with ceph-admin, and relaunch the Ceph?

Solution

Ceph requires free disk space to move storage chunks, called pgs, between different disks. As this free space is so critical to the underlying functionality, Ceph will go into HEALTH_WARN once any OSD reaches the near_full ratio (generally 85% full), and will stop write operations on the cluster by entering HEALTH_ERR state once an OSD reaches the full_ratio.

However, unless your cluster is perfectly balanced across all OSDs there is likely much more capacity available, as OSDs are typically unevenly utilized. To check overall utilization and available capacity you can run ceph osd df.

Example output:

ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL    %USE  VAR  PGS STATUS 
 2   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB  72 MiB 3.6 GiB  742 GiB 73.44 1.06 406     up 
 5   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB 119 MiB 3.3 GiB  726 GiB 74.00 1.06 414     up 
12   hdd 2.72849  1.00000 2.7 TiB 2.2 TiB 2.2 TiB  72 MiB 3.7 GiB  579 GiB 79.26 1.14 407     up 
14   hdd 2.72849  1.00000 2.7 TiB 2.3 TiB 2.3 TiB  80 MiB 3.6 GiB  477 GiB 82.92 1.19 367     up 
 8   ssd 0.10840        0     0 B     0 B     0 B     0 B     0 B      0 B     0    0   0     up 
 1   hdd 2.72849  1.00000 2.7 TiB 1.7 TiB 1.7 TiB  27 MiB 2.9 GiB 1006 GiB 64.01 0.92 253     up 
 4   hdd 2.72849  1.00000 2.7 TiB 1.7 TiB 1.7 TiB  79 MiB 2.9 GiB 1018 GiB 63.55 0.91 259     up 
10   hdd 2.72849  1.00000 2.7 TiB 1.9 TiB 1.9 TiB  70 MiB 3.0 GiB  887 GiB 68.24 0.98 256     up 
13   hdd 2.72849  1.00000 2.7 TiB 1.8 TiB 1.8 TiB  80 MiB 3.0 GiB  971 GiB 65.24 0.94 277     up 
15   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB  58 MiB 3.1 GiB  793 GiB 71.63 1.03 283     up 
17   hdd 2.72849  1.00000 2.7 TiB 1.6 TiB 1.6 TiB 113 MiB 2.8 GiB  1.1 TiB 59.78 0.86 259     up 
19   hdd 2.72849  1.00000 2.7 TiB 1.6 TiB 1.6 TiB 100 MiB 2.7 GiB  1.2 TiB 56.98 0.82 265     up 
 7   ssd 0.10840        0     0 B     0 B     0 B     0 B     0 B      0 B     0    0   0     up 
 0   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB 105 MiB 3.0 GiB  734 GiB 73.72 1.06 337     up 
 3   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB  98 MiB 3.0 GiB  781 GiB 72.04 1.04 354     up 
 9   hdd 2.72849        0     0 B     0 B     0 B     0 B     0 B      0 B     0    0   0     up 
11   hdd 2.72849  1.00000 2.7 TiB 1.9 TiB 1.9 TiB  76 MiB 3.0 GiB  817 GiB 70.74 1.02 342     up 
16   hdd 2.72849  1.00000 2.7 TiB 1.8 TiB 1.8 TiB  98 MiB 2.7 GiB  984 GiB 64.80 0.93 317     up 
18   hdd 2.72849  1.00000 2.7 TiB 2.0 TiB 2.0 TiB  79 MiB 3.0 GiB  792 GiB 71.65 1.03 324     up 
 6   ssd 0.10840        0     0 B     0 B     0 B     0 B     0 B      0 B     0    0   0     up 
                    TOTAL  47 TiB  30 TiB  30 TiB 1.3 GiB  53 GiB   16 TiB 69.50                 
MIN/MAX VAR: 0.82/1.19  STDDEV: 6.64

As you can see in the above output, the used OSDs vary from 56.98% (OSD 19) to 82.92% (OSD 14) utilized, which is a significant variance.

As only a single OSD is full, and only 4 of your 21 OSD's are nearfull you likely have a significant amount of storage still available in your cluster, which means that it is time to perform a rebalance operation. This can be done manually by reweighting OSDs, or you can have Ceph do a best-effort rebalance by running the command ceph osd reweight-by-utilization. Once the rebalance is complete (i.e you have no objects misplaced in ceph status) you can check for the variation again (using ceph osd df) and trigger another rebalance if required.

If you are on Luminous or newer you can enable the Balancer plugin to handle OSD rewighting automatically.