architecture timeout aerospike key-value-store

Getting batch-read timeouts in aerospike : batch_queue pile up

Background:

Am using an aerospike cluster with 9 nodes. The cluster seems to work fine but there are some intermittent timeouts for some of the batch-reads. The timeouts are occurring at server-side itself, interestingly , only on 2 out of 9 nodes. I suspected key-hotspotting to be the issue here, but seems it is not the case.

On checking the server statistics, the thing which pops up is the correlation between the batch_queue size and the timeouts.

Command : asadm -e "watch 1 100 show stat like batch"

[ 2017-09-07 20:56:10 'show stat like batch' sleep: 1.0s iteration: 47 of 100 ]

batch_queue : 586
batch_timeout : 81709

[ 2017-09-07 20:56:11 'show stat like batch' sleep: 1.0s iteration: 48 of 100 ]

batch_queue : 545
batch_timeout : 84357

[ 2017-09-07 20:56:12 'show stat like batch' sleep: 1.0s iteration: 49 of 100 ]

batch_queue : 0
batch_timeout : 88544

[ 2017-09-07 20:56:13 'show stat like batch' sleep: 1.0s iteration: 50 of 100 ]

batch_queue : 0
batch_timeout : 88544

[ 2017-09-07 20:56:14 'show stat like batch' sleep: 1.0s iteration: 51 of 100 ]

batch_queue : 0
batch_timeout : 88544

There seems to be a clear correlation between the batch_queue piling up and the requests getting timeout.

Questions

What is exactly this batch queue. What is the reason of it's piling up only in a couple of aerospike nodes ?
How can i fix the same ?

Thanks

Edit :

http://www.aerospike.com/docs/guide/batch.html. This answers question no.1 decently well.

Solution

I would suggest, if possible (depending on client), moving over to using batch-index. Timeouts on some of the nodes may indicate few different things:

some nodes getting more records per batch than others
some difference in those nodes (CPU, kernel version, storage, config) causing them to be slower
other activity on those nodes causing them to slow down (hot keys on other read/write transactions)

Basically anything that would slow down those nodes, causing the batch queue to pile up and some batch transactions to time out.

Finally, you could try increasing the batch-threads and batch-priority if you haven't done so yet.