Search code examples
cassandradata-modelingbigtablecompound-key

Why use a compound clustered key in Cassandra tables?


Why might one want to use a clustered index in a cassandra table?

For example; in a table like this:

CREATE TABLE blah (
  key text,
  a text,
  b timestamp,
  c double,
  PRIMARY KEY ((key), a, b, c)
)

The clustered part is the a, b, c part of the PRIMARY KEY.

What are the benefits? What considerations are there?


Solution

  • Clustering keys do three main things.

    1) They affect the available query pattern of your table.

    2) They determine the on-disk sort order of your table.

    3) They determine the uniqueness of your primary key.

    Let's say that I run an ordering system and want to store product data on my website. Additionally I have several distribution centers, as well as customer contracted pricing. So when a certain customer is on my site, they can only access products that are:

    • Available in a distribution center (DC) in their geographic area.

    • Defined in their contract (so they may not necessarily have access to all products in a DC).

    To keep track of those products, I'll create a table that looks like this:

    CREATE TABLE customerDCProducts (
      customerid text,
      dcid text,
      productid text,
      productname text,
      productPrice int,
      PRIMARY KEY (customerid, dcid, productid));
    

    For this example, if I want to see product 123, in DC 1138, for customer B-26354, I can use this query:

    SELECT * FROM customerDCProducts
    WHERE customerid='B-26354' AND dcid='1138' AND productid='123';
    

    Maybe I want to see products available in DC 1138 for customer B-26354:

    SELECT * FROM customerDCProducts 
    WHERE customerid='B-26354' AND dcid='1138';
    

    And maybe I just want to see all products in all DCs for customer B-26354:

    SELECT * FROM customerDCProducts 
    WHERE customerid='B-26354';
    

    As you can see, the clustering keys of dcid and productid allow me to run high-performing queries on my partition key (customerid) that are as focused as I may need.

    The drawback? If I want to query all products for a single DC, regardless of customer, I cannot. I'll need to build a different query table to support that. Even if I want to query just one product, I can't unless I also provide a customerid and dcid.

    What if I want my data ordered a certain way? For this example, I'll take a cue from Patrick McFadin's article on Getting Started With Time Series Data Modeling, and build a table to keep track of the latest temperatures for weather stations.

    CREATE TABLE latestTemperatures (
      weatherstationid text,
      eventtime timestamp,
      temperature text,
      PRIMARY KEY (weatherstationid,eventtime),
    ) WITH CLUSTERING ORDER BY (eventtime DESC);
    

    By clustering on eventtime, and specifying a DESCending ORDER BY, I can query the recorded temperatures for a particular station like this:

    SELECT * FROM latestTemperatures 
    WHERE weatherstationid='1234ABCD';
    

    When those values are returned, they will be in DESCending order by eventtime.

    Of course, the one question that everyone (with a RDBMS background...so yes, everyone) wants to know, is how to query all results ordered by eventtime? And again, you cannot. Of course, you can query for all rows by omitting the WHERE clause, but that won't return your data sorted in any meaningful order. It's important to remember that Cassandra can only enforce clustering order within a partition key. If you don't specify one, your data will not be ordered (at least, not in the way that you want it to be).

    Let me know if you have any additional questions, and I'll be happy to explain.