Search code examples
searchcassandralucenestratio

How Stratio Lucene Index works for Cassandra


I just started looking into Stratio, but I have some basic questions for which I am getting confused:

  • I heard like using of secondary indexes in Cassandra is not suggestible, but looks like Stratio is Lucene based implementation of cassandra's secondary index. Do I need to compromise performance of Cassandra's if I use Stratio? Will there be any latency for normal CQL queries?

  • How it internally indexes data? Will it duplicate my entire existing data?

  • Is it suggestible to use Stratio in production? How stable is it?

  • In order to query non partition or non clustering keys we can achieve this by creating secondary indexes, Even by using Stratio I feel like we are doing same. How Stratio's custom index really differs from Cassandras secondary index?


Solution

  • I heard like using of secondary indexes in cassandra is not suggestible, but looks like Stratio is lucene based implementation of cassandra's secondary index. Do I need to compromise performance of cassandra's if I use stratio? Will there be any latency for normal cql queries?

    Stratio’s Cassandra Lucene Index is just another implementation for Cassandra secondary indexes. The performance loss due to indexing will probably not be any worse by using Stratio’s Cassandra Lucene Index. The advantage as I can see it is that for with Stratio's solution you will get Lucene near real time free text search capabilities as compared to Cassandra's default indexing solution which is based on an exact field match. Read more here Cassandra lucene performance question and here Stratio’s Cassandra Lucene Index GitHub

    How it internally indexes data? Will it duplicate my entire existing data?

    An index will by definition not duplicate the data. The index is a sort of a reverse lookup. The different fields are indexed with a pointer to the actual records. (like an index of different terms at the end of a book). So if you have a field in your records that stores the "country of origin" then say that 50% of your records will have that country set to USA and the other 50% to Canada. In the index USA will only be stored once and Canada once with references to half of the records each. This means that the more different the indexed fields are the more storage will be needed for the data. Here is also where Lucene solves the free text search in a good way by tokenizing text into different words and applying a scoring mechanism for the abundance of each word in different texts. More here: Full Text Search of Dialogues with Apache Lucene: A Tutorial Lucene Basic Concepts

    Is it suggestable to use Stratio in production level, How far it is stable?

    It's hard to answer this. It highly depends on what and how you are going to use it in production. I would advise you to go for a proof of concept / prototype and try it out.

    In order to query non partition or non clustering keys we can achieve this by creating secondary indexes, Even by using Stratio I feel like we are doing same. How Stratios custom index really differs from Cassandras secondary index?

    As mentioned above, Lucene is good at free text search, it has a multitude of different query types, it's fast and flexible. On the other hand if your search requirements are limited to a few exact match fields, then going for the standard Cassandra index solution might be the way to go.

    Good luck, Teo