Aerospike cluster not clean available blocks

we use aerospike in our projects and caught strange problem. We have a 3 node cluster and after some node restarting it stop working. So, we make test to explain our problem

We make test cluster. 3 node, replication count = 2

Here is our namespace config

namespace test{
replication-factor 2
memory-size 100M
high-water-memory-pct 90
high-water-disk-pct 90
stop-writes-pct 95
single-bin true
default-ttl 0
storage-engine device {
cold-start-empty true
file /tmp/test.dat
write-block-size 1M
}

We write 100Mb test data after that we have that situation

available pct equal about 66% and Disk Usage about 34%

All good :slight_smile:

But we stopped one node. After migration we see that available pct = 49% and disk usage 50%

Return node to cluster and after migration we see that disk usage became previous about 32%, but available pct on old nodes stay 49%

Stop node one more time

available pct = 31%

Repeat one more time we get that situation available pct = 0%

Our cluster crashed, Clients get AerospikeException: Error Code 8: Server memory error

So how we can clean available pct?

Solution

If your defrag-q is empty (and you can see whether it is from grepping the logs) then the issue is likely to be that your namespace is smaller than your post-write-queue. Blocks on the post-write-queue are not eligible for defragmentation and so you would see avail-pct trending down with no defragmentation to reclaim the space. By default the post-write-queue is 256 blocks and so in your case that would equate to 256Mb. If your namespace is smaller than that you will see avail-pct continue to drop until you hit stop-writes. You can reduce the size of the post-write-queue dynamically (i.e. no restart needed) using the following command, here I suggest 8 blocks:

asinfo -v 'set-config:context=namespace;id=<NAMESPACE>;post-write-queue=8'

If you are happy with this value you should amend your aerospike.conf to include it so that it persists after a node restart.