Search code examples
databasecassandranosqlquery-optimization

Querying a High Cardinality Field


I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.

My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.

What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.

Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.

What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?


Solution

  • I thought why not create another table that stores the mapping.

    That's okay. From Cassandra documentation:

    Do not use an index in these situations:

    On high-cardinality columns because you then query a huge volume of records for a small number of results. See Problems using a high-cardinality column index below.

    Problems using a high-cardinality column index

    If you create an index on a high-cardinality column, which has many distinct values, a query between the fields incurs many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their recording artist is likely to be very inefficient..

    It would probably be more efficient to manually maintain the table as a form of an index instead of using the built-in index. For columns containing unique data, it is sometimes fine performance-wise to use an index for convenience, as long as the query volume to the table having an indexed column is moderate and not under constant load.

    Conversely, creating an index on an extremely low-cardinality column, such as a boolean column, does not make sense. Each value in the index becomes a single row in the index, resulting in a huge row for all the false values, for example. Indexing a multitude of indexed columns having foo = true and foo = false is not useful.

    It's normal for Cassandra data modelling to have a denormalized data.