what is the advantage of using btree in aerospike for primary index?

I was going through the documentation of Aerospike. and found that for storing primary key, Aerospike uses hashing and hash is pointed to a BTree and bTree contains pointer to actual record. As far as i know Redis only uses hashing (for collision resolution they maintain a list for hash).and hash is pointed to actual record.

what is the advantage of Btree use by aerospike? doesn't it mean that to access a record by its primary key aerospike would take O(logn) ? while redis would take only O(1).

I may be wrong but that's all i understood from documentation. can some one please throw more light on this topic.?

Solution

I'm not sure the point of the question, but here goes:

Actually Aerospike's primary index is a distributed hash of red-black trees, between 1 and 4096 sprigs per partition (see the partition-tree-sprigs config param).

There are 4096 logical partitions which are evenly distributed across the nodes of the cluster. The key identifying any record is a 20-byte digest produced by passing the (namespace, set, PK) 3-tuple through RIPEMD-160 (the client does that automatically for you). The record is consistently hashed to a specific partition, as bits in this digest are used to calculate the partition ID.

As opposed to Redis, which was designed to be a single core, single-threaded application running on a single node, Aerospike was built as a distributed database. It's true that users can ad-hoc cluster Redis using application-side solutions or middleware. In the case of Aerospike all the nodes in the cluster, and all the clients share a partition map.

Since the client is aware of the cluster's partition map, it is always one hop away from the node holding the master partition (or a node holding the replica partition - this is controlled by the replica read policy). So, it's O(1) to the correct node in the cluster. Within that node it's O(1) to find the partition's rbTree, and then all operations are O(log n).

When there isn't a lot of data in a hash table (assuming you're right about the data structure used by Redis), finding a record should be O(1). But, once there are more elements than slots in the hash table it switches to a linked list, which is O(n). For the rbTree it's O(log n) for both average and worst case. Aerospike is intended to handle large data sets with predictable low latency, so the rbTree was more appropriate. The cost of getting a record will be the same regardless of the amount of data in the cluster.

Addition: as of Aerospike DB version 4.2, sprigs became much cheaper in terms of memory, and the limit off 4096 sprigs per partition has been removed. You can effectively turn sprigs into a depth 1 red-black tree by allocating enough sprigs, so O(log n) can be made to virtually be the same as O(1). For example, if you wanted an average tree depth of 1 for the sprigs, and you had a billion objects in your database, you’d set partition-tree-sprigs to 262144 (has to be a power of 2), which would cost 848MB evenly distributed to the nodes (283MB in a 3 node cluster).