Search code examples
cassandranosqldistributed-databasesecondary-indexes

Why secondary indexes are less efficient in Cassandra?


I read in Cassandra documentation that creating secondary index is less efficient as because in worst case it need to touch all nodes in order to find out the data of that non-key column.

But my doubt is even if we do not create secondary index, then also it will have to touch all nodes (in worst case) and find out where that particular row with this non-key column value resides.

Note: Yeah, I understand that it is possible that if the cardinality is high then the secondary index will contain(store) index for mostly all rows and in this way it is bad in terms of storage. But I want to know how not creating secondary index is efficient than creating secondary index?


Solution

  • Secondary indexes should be used only in specific cases, like, when you use them together with condition on partition key column, you have correct cardinality for data, etc.

    For example, if we have following table:

    create table test.test (
      pk int,
      c1 int,
      val1 int,
      val2 int,
      primary key(pk, c1));
    

    and you created a secondary index on the column val2, then following query will be very effective:

    select * from test.test where pk = 123 and val2 = 10
    

    because you restricted the execution of query only to the nodes that are replicas for pk with value 123.

    But if you do

    select * from test.test where val2 = 10
    

    then Cassandra will need to go to the every node, and ask for data there - it will be much slower, and put a pressure to coordinating node.

    Standard secondary indexes have other limitations, such as, search only for specific values, problems when column has very low or very high cardinality, etc. SASI indexes are better from design standpoint, although they are still experimental, and have problems with implementation.

    You can find technical details about implementation of secondary indexes in the following blog post.

    DataStax has other implementations in the commercial offering:

    • DSE Search that is based on the Apache Solr, so you get a lot of flexibility (full text search, range queries, etc.)
    • new implementation called SSTable Attached Indexes (SAI) - they are currently marked as beta, but they provide more flexibility than standard secondary indexes, with less overhead than DSE Search