In the aerospike set we have four bins userId, adId, timestamp, eventype and the primary key is userId:timestamp. Secondary Index is created on userId to get all the records for a particular user and the resulted records are passed to stream udf. On our client side till 500 qps the aerospike query latency is reasonable and mean latency is in microseconds but as soon as we increase the qps above 500 the aerospike query latency shoots up (around ~ 10 ms)
message that we see on the client side is attached below:
Name: Aerospike-13780
State: WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@41aa29a4
Total blocked: 0 Total waited: 554,450
Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
com.aerospike.client.lua.LuaInputStream.read(LuaInputStream.java:38)
com.aerospike.client.lua.LuaStreamLib$read.call(LuaStreamLib.java:60)
org.luaj.vm2.LuaClosure.execute(Unknown Source)
org.luaj.vm2.LuaClosure.onInvoke(Unknown Source)
org.luaj.vm2.LuaClosure.invoke(Unknown Source)
org.luaj.vm2.LuaClosure.execute(Unknown Source)
org.luaj.vm2.LuaClosure.onInvoke(Unknown Source)
org.luaj.vm2.LuaClosure.invoke(Unknown Source)
org.luaj.vm2.LuaValue.invoke(Unknown Source)
com.aerospike.client.lua.LuaInstance.call(LuaInstance.java:128)
com.aerospike.client.query.QueryAggregateExecutor.runThreads(QueryAggregateExecutor.java:104)
com.aerospike.client.query.QueryAggregateExecutor.run(QueryAggregateExecutor.java:77)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
below is the lua file:
function ad_count(stream)
local function map_function(record)
local result = map()
result["adId"] = record["adId"]
result["timestamp"] = record["timestamp"]
return result
end
local function add_fn(aggregate, record)
local ad_id = record["adId"]
local map_result = aggregate[ad_id]
local l = list()
if map_result == null then
map_result = l
end
list.append(map_result, record["timestamp"])
aggregate[ad_id] = map_result
return aggregate
end
local m = map()
return stream:map(map_function):aggregate(m, add_fn)
end
There are 2 nodes and the server is hosted in AWS with T2.large instance type.
transaction-queues=8;transaction-threads-per-queue=8;transaction-duplicate-threads=0;transaction-pending-limit=20;migrate-threads=1;migrate-xmit-priority=40;migrate-xmit-sleep=500;migrate-read-priority=10;migrate-read-sleep=500;migrate-xmit-hwm=10;migrate-xmit-lwm=5;migrate-max-num-incoming=256;migrate-rx-lifetime-ms=60000;proto-fd-max=15000;proto-fd-idle-ms=60000;proto-slow-netio-sleep-ms=1;transaction-retry-ms=1000;transaction-max-ms=1000;transaction-repeatable-read=false;dump-message-above-size=134217728;ticker-interval=10;microbenchmarks=false;storage-benchmarks=false;ldt-benchmarks=false;scan-max-active=100;scan-max-done=100;scan-max-udf-transactions=32;scan-threads=4;batch-index-threads=4;batch-threads=4;batch-max-requests=5000;batch-max-buffers-per-queue=255;batch-max-unused-buffers=256;batch-priority=200;nsup-delete-sleep=100;nsup-period=120;nsup-startup-evict=true;paxos-retransmit-period=5;paxos-single-replica-limit=1;paxos-max-cluster-size=32;paxos-protocol=v3;paxos-recovery-policy=manual;write-duplicate-resolution-disable=false;respond-client-on-master-completion=false;replication-fire-and-forget=false;info-threads=16;allow-inline-transactions=true;use-queue-per-device=false;snub-nodes=false;fb-health-msg-per-burst=0;fb-health-msg-timeout=200;fb-health-good-pct=50;fb-health-bad-pct=0;auto-dun=false;auto-undun=false;prole-extra-ttl=0;max-msgs-per-type=-1;service-threads=40;fabric-workers=16;pidfile=/var/run/aerospike/asd.pid;memory-accounting=false;udf-runtime-gmax-memory=18446744073709551615;udf-runtime-max-memory=18446744073709551615;sindex-builder-threads=4;sindex-data-max-memory=18446744073709551615;query-threads=6;query-worker-threads=15;query-priority=10;query-in-transaction-thread=0;query-req-in-query-thread=0;query-req-max-inflight=100;query-bufpool-size=256;query-batch-size=100;query-priority-sleep-us=1;query-short-q-max-size=500;query-long-q-max-size=500;query-rec-count-bound=18446744073709551615;query-threshold=10;query-untracked-time-ms=1000;pre-reserve-qnodes=false;service-address=0.0.0.0;service-port=3000;mesh-address=10.0.1.80;mesh-port=3002;reuse-address=true;fabric-port=3001;fabric-keepalive-enabled=true;fabric-keepalive-time=1;fabric-keepalive-intvl=1;fabric-keepalive-probes=10;network-info-port=3003;enable-fastpath=true;heartbeat-mode=mesh;heartbeat-protocol=v2;heartbeat-address=10.0.1.80;heartbeat-port=3002;heartbeat-interval=150;heartbeat-timeout=10;enable-security=false;privilege-refresh-period=300;report-authentication-sinks=0;report-data-op-sinks=0;report-sys-admin-sinks=0;report-user-admin-sinks=0;report-violation-sinks=0;syslog-local=-1;enable-xdr=false;xdr-namedpipe-path=NULL;forward-xdr-writes=false;xdr-delete-shipping-enabled=true;xdr-nsup-deletes-enabled=false;stop-writes-noxdr=false;reads-hist-track-back=1800;reads-hist-track-slice=10;reads-hist-track-thresholds=1,8,64;writes_master-hist-track-back=1800;writes_master-hist-track-slice=10;writes_master-hist-track-thresholds=1,8,64;proxy-hist-track-back=1800;proxy-hist-track-slice=10;proxy-hist-track-thresholds=1,8,64;udf-hist-track-back=1800;udf-hist-track-slice=10;udf-hist-track-thresholds=1,8,64;query-hist-track-back=1800;query-hist-track-slice=10;query-hist-track-thresholds=1,8,64;query_rec_count-hist-track-back=1800;query_rec_count-hist-track-slice=10;query_rec_count-hist-track-thresholds=1,8,64
We even changed the following config parameters but it further increased the latency:
query-batch-size=1000
query-short-q-max-size=100000
query-long-q-max-size=100000
query-threads=28
query-worker-threads=400
query-req-max-inflight=1000
Linking back to the same question on the Aerospike community forum: https://discuss.aerospike.com/t/high-query-latency/2769/3
You're mainly bumping against the limits of that specific instance. We do not recommend using the t2 family. Please review the recommendations section of Aerospike's Amazon deployment guide. You should consider instances in the m3, c3, c4, r3, or i2 instance families.
A quick note on your configuration:
transaction-queues
should be tuned to the number of cores. You have it set to 8, and a t2.large has 2 cores.transaction-threads-per-queue
is set too high for this instance type.query-threads
and query-worker-threads
are both set too high for this instance.Further, your queries involves a stream UDF so they will be slower than a regular query. UDFs are useful but you need to consider the context for which they're a good fit.
My suggestion is that you can probably model this differently, and skip the secondary index and UDF.