java multithreading performance configuration aerospike

Slow queryAggreate in Aerospike with multithreading

We're creating a Java client to write data directly into memory in Aerospike, and another Java client to read data from memory. Both clients are multi-threaded.

There are several queryAggregate operations, which was implemented in UDF, inside our read client.

We're facing one issue as below:

If we allocate 1 thread only for write operation, and 2 threads for read operation, then we have ~25K TPS for reading.

If we allocate 2 threads for write operation, keeping the same number of threads for read operation, then we have only less than 10K TPS for reading.

The Aerospike server is running in a machine which has 24 physical CPU cores. Both writing and reading clients are running at the same time on this machine. The server is almost running Aerospike server only. CPU resource is totally free.

Below is our current Aerospike server configuration:

paxos-single-replica-limit=1;pidfile=null;proto-fd-max=15000;advertise-ipv6=false;auto-pin=none;batch-threads=4;batch-max-buffers-per-queue=255;batch-max-requests=5000;batch-max-unused-buffers=256;batch-priority=200;batch-index-threads=24;clock-skew-max-ms=1000;cluster-name=null;enable-benchmarks-fabric=false;enable-benchmarks-svc=false;enable-hist-info=false;hist-track-back=300;hist-track-slice=10;hist-track-thresholds=null;info-threads=16;log-local-time=false;migrate-max-num-incoming=4;migrate-threads=1;min-cluster-size=1;node-id-interface=null;nsup-delete-sleep=100;nsup-period=120;nsup-startup-evict=true;proto-fd-idle-ms=60000;proto-slow-netio-sleep-ms=1;query-batch-size=100;query-buf-size=2097152;query-bufpool-size=256;query-in-transaction-thread=false;query-long-q-max-size=500;query-microbenchmark=false;query-pre-reserve-partitions=false;query-priority=10;query-priority-sleep-us=1;query-rec-count-bound=18446744073709551615;query-req-in-query-thread=false;query-req-max-inflight=100;query-short-q-max-size=500;query-threads=6;query-threshold=10;query-untracked-time-ms=1000;query-worker-threads=15;run-as-daemon=true;scan-max-active=100;scan-max-done=100;scan-max-udf-transactions=32;scan-threads=4;service-threads=24;sindex-builder-threads=4;sindex-gc-max-rate=50000;sindex-gc-period=10;ticker-interval=10;transaction-max-ms=1000;transaction-pending-limit=20;transaction-queues=4;transaction-retry-ms=1002;transaction-threads-per-queue=4;work-directory=/opt/aerospike;debug-allocations=none;fabric-dump-msgs=false;max-msgs-per-type=-1;prole-extra-ttl=0;service.port=3000;service.address=any;service.access-port=0;service.alternate-access-port=0;service.tls-port=0;service.tls-access-port=0;service.tls-alternate-access-port=0;service.tls-name=null;heartbeat.mode=multicast;heartbeat.multicast-group=239.1.99.222;heartbeat.port=9918;heartbeat.interval=150;heartbeat.timeout=10;heartbeat.mtu=1500;heartbeat.protocol=v3;fabric.port=3001;fabric.tls-port=0;fabric.tls-name=null;fabric.channel-bulk-fds=2;fabric.channel-bulk-recv-threads=4;fabric.channel-ctrl-fds=1;fabric.channel-ctrl-recv-threads=4;fabric.channel-meta-fds=1;fabric.channel-meta-recv-threads=4;fabric.channel-rw-fds=8;fabric.channel-rw-recv-threads=16;fabric.keepalive-enabled=true;fabric.keepalive-intvl=1;fabric.keepalive-probes=10;fabric.keepalive-time=1;fabric.latency-max-ms=5;fabric.recv-rearm-threshold=1024;fabric.send-threads=8;info.port=3003;enable-security=false;privilege-refresh-period=300;report-authentication-sinks=0;report-data-op-sinks=0;report-sys-admin-sinks=0;report-user-admin-sinks=0;report-violation-sinks=0;syslog-local=-1

Below is aerospike.conf file:

# Aerospike database configuration file for use with systemd.

service {
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    proto-fd-max 15000
}

logging {
    console {
        context any info
    }
}

network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode multicast
        multicast-group 239.1.99.222
        port 9918

        # To use unicast-mesh heartbeats, remove the 3 lines above, and see
        # aerospike_mesh.conf for alternative.

        interval 150
        timeout 10
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace test {
    replication-factor 2
    memory-size 4G
    default-ttl 30d # 30 days, use 0 to never expire/evict.

    storage-engine memory
}

namespace bar {
    replication-factor 2
    memory-size 4G
    default-ttl 30d # 30 days, use 0 to never expire/evict.

    storage-engine memory

    # To use file storage backing, comment out the line above and use the
    # following lines instead.
#   storage-engine device {
#       file /opt/aerospike/data/bar.dat
#       filesize 16G
#       data-in-memory true # Store data in memory in addition to file.
#   }
}

Could someone please let us know where our current bottleneck is? How we can increase the reading speed when increasing the number of writing threads?

The above configuration is default, we didn't change anything yet.

Solution

What I'm not sure you're saying:

First thing, I have no idea what you mean by reading with 1 thread or 2 threads. You're saying that you use 2 instances of AerospikeClient. Are these split across different client machines, or are they both on the same instance?

Next point, the Java client is multithreaded (not 1 thread or 2 threads as you wrote). If you're using the synchronous client, each operation will run in a thread and wait for the response. Please look at the introduction to the Java client on the Aerospike site.

Is your Aerospike cluster just a single node? It can't do replication factor 2 with just one node.

Predicate Filtering vs. UDF based logic

Whatever logic you're doing in the filter of your stream UDF, try to move it to predicate filtering instead. In the Java client this is implemented in the PredExp class (see the examples for it).

Configuration Tuning

You're doing writes and queries, no single-record reads or batch reads. You should be tuning down the batch-index threads, and the query threads up.

You have two in-memory namespaces that are configured identically. Kill both foo and bar and let's define a different one:

service {
    paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
    proto-fd-max 15000
    batch-index-threads 2 # you don't need 24 batch threads, you're not using them
    query-threads 24 # setting it to #cpu
    query-in-transaction-thread true # because you query an in-memory namespace
    query-priority 40
    # auto-pin cpu # uncomment this if you have kernel >= 3.19
}

logging {
    console {
        context any info
    }
}

network {
    service {
        address any
        port 3000
    }

    heartbeat {
        mode multicast
        multicast-group 239.1.99.222
        port 9918

        # To use unicast-mesh heartbeats, remove the 3 lines above, and see
        # aerospike_mesh.conf for alternative.

        interval 150
        timeout 10
    }

    fabric {
        port 3001
    }

    info {
        port 3003
    }
}

namespace demo {
    replication-factor 2
    memory-size 10G
    partition-tree-sprigs 4096 # maximize these for in-memory, you have plenty of DRAM
    default-ttl 30d

    storage-engine memory
}

I believe you should

Lower the number of batch-index threads (batch-index-threads)
Increase the number of query-threads to one per-CPU core
Raise the query-priority
Because you're working with an in-memory namespace, I'm setting the query-in-transaction-thread config param to true.
Maximize partition-tree-sprigs. I suggest that you should be using auto-pin cpu.

See: What's New In Aerospike 3.12, What's New in Aerospike 3.13 & 3.14

What Else?

This remains to be seen, based on the results you get with the adjusted configuration. Later you need to figure out how many objects you have in the system and what their average object size is for capacity planning .