Search code examples
aerospikeclustered-indexdistributed-database

Aerospike: How Primary & Secondary Index works internally


We are using Aerospike DB and was going through the documentation.
I could not find good explanation of algorithm explaining how Primary & Secondary index works.
The documentation says it uses some sort of distributed hash + B Tree.

Could someone please explain it.


Solution

  • The primary index is a mix of a distributed hash and distributed trees. It holds the metadata for every record in the Aerospike cluster.

    Each namespace has 4096 partitions that are evenly distributed to the nodes of the cluster, by way of the partition map. Within the node, the primary index is an in-memory structure that indexes only the partitions assigned to the node.

    The primary index has a hash table that leads to sprigs. Each sprig is a red-black tree that holds a portion of the metadata. The number of sprigs per-partition is configurable through partition-tree-sprigs.

    Therefore, to find any record in the cluster, the client first uses the record's digest to find the correct node with one lookup against the partition map. Then, the node holding the master partition for the record will look up its metadata in the primary index. If this namespace stores data on SSD, the metadata includes the device, block ID and byte offset of the record, so it can be read with a single read operation. The records are stored contiguously, whether on disk or in memory.

    The primary index is used for operations against a single record (identified by its key), or batch operations against multiple records (identified by a list of keys). It's also used by scans.

    Secondary indexes are optional in-memory structures within each node of the cluster, that also only index the records of the partitions assigned to each node. They're used for query operations, which are intended to return many records based on a non-key predicate.

    Because Aerospike is a distributed database, a query must go to all the nodes. The concurrency level (how many nodes are queried at a time) is controlled through a query policy in the client. Each node receiving the query has to lookup the criteria of the predicate against the appropriate secondary index. This returns zero to many records. At this point the optional predicate filter can be applied. The records found by secondary index query are then streamed back to the client. See the documentation on managing indexes.