Search code examples
cassandrageohashing

Are there any major disadvantages to having multiple clustering columns in cassandra?


I'm designing a cassandra table where I need to be able able to retrieve rows by their geohash. I have something that works, but I'd like to avoid range queries more so than I'm currently able to.

The current table schema is this, with geo_key containing the first five characters of the geohash string. I query using the geo_key, then range filter on the full geohash, allowing me to prefix search based on a 5 or greater length geohash:

CREATE TABLE georecords (geo_key text,geohash text, data text) PRIMARY KEY (geo_key, geohash))

My idea is that I could instead store the characters of the geohash as seperate columns, allowing me to specify as many caracters as I wanted, to do a prefix match on the geohash. My concern is what impact using multiple clustering columns might have:

CREATE TABLE georecords (g1 text,g2 text,g3 text,g4 text,g5 text,g6 text,g7 text,g8 text,geohash text, data text) PRIMARY KEY (g1,g2,g3,g4,g5,g6,g7,g8,geohash,pid))

(I'm not really concerned about the cardinality of the partition key - g1 would have minimum 30 values, and I have other workarounds for it as well)

Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?


Solution

  • Other that cardinality of the partition key, and extra storage requirements, what should I be aware of if I used the many cluster column approach?

    This seemed like an interesting problem to help out with, so I built a few CQL tables of differing PRIMARY KEY structure and options. I then used http://geohash.org/ to come up with a few endpoints, and inserted them.

    aploetz@cqlsh:stackoverflow> SELECT g1, g2, g3, g4, g5, g6, g7, g8, geohash, pid, data FROm georecords3;
    
     g1 | g2 | g3 | g4 | g5 | g6 | g7 | g8 | geohash      | pid  | data
    ----+----+----+----+----+----+----+----+--------------+------+---------------
      d |  p |  8 |  9 |  v |  c |  n |  e |  dp89vcnem4n | 1001 |    Beloit, WI
      d |  p |  8 |  c |  p |  w |  g |  v |    dp8cpwgv3 | 1003 |   Harvard, IL
      d |  p |  c |  8 |  g |  e |  k |  t | dpc8gektg8w7 | 1002 | Sheboygan, WI
      9 |  x |  j |  6 |  5 |  j |  5 |  1 |    9xj65j518 | 1004 |    Denver, CO
    
    (4 rows)
    

    As you know, Cassandra is designed to return data with a specific, precise key. Using multiple clustering columns helps in that approach, in that you are helping Cassandra quickly identify the data you wish to retrieve.

    The only thing I would think about changing, is to see if you can do without either geohash or pid in the PRIMARY KEY. My gut says to get rid of pid, as it really isn't anything that you would query by. The only value it provides is that of uniqueness, which you will need if you plan on storing the same geohashes multiple times.

    Including pid in your PRIMARY KEY leaves you with one non-key column, and that allows you to use the WITH COMPACT STORAGE directive. Really the only true edge that gets you, is in saving disk space as the clustering column names are not stored with the value. This becomes apparent when looking at the table from within the cassandra-cli tool:

    Without compact storage:

    [default@stackoverflow] list georecords3;
    Using default limit of 100
    Using default cell limit of 100
    -------------------
    RowKey: d
    => (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:, value=, timestamp=1428766191314431)
    => (name=p:8:9:v:c:n:e:dp89vcnem4n:1001:data, value=42656c6f69742c205749, timestamp=1428766191314431)
    => (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:, value=, timestamp=1428766191382903)
    => (name=p:8:c:p:w:g:v:dp8cpwgv3:1003:data, value=486172766172642c20494c, timestamp=1428766191382903)
    => (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:, value=, timestamp=1428766191276179)
    => (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002:data, value=536865626f7967616e2c205749, timestamp=1428766191276179)
    -------------------
    RowKey: 9
    => (name=x:j:6:5:j:5:1:9xj65j518:1004:, value=, timestamp=1428766191424701)
    => (name=x:j:6:5:j:5:1:9xj65j518:1004:data, value=44656e7665722c20434f, timestamp=1428766191424701)
    
    2 Rows Returned.
    Elapsed time: 217 msec(s).
    

    With compact storage:

    [default@stackoverflow] list georecords2;
    Using default limit of 100
    Using default cell limit of 100
    -------------------
    RowKey: d
    => (name=p:8:9:v:c:n:e:dp89vcnem4n:1001, value=Beloit, WI, timestamp=1428765102994932)
    => (name=p:8:c:p:w:g:v:dp8cpwgv3:1003, value=Harvard, IL, timestamp=1428765717512832)
    => (name=p:c:8:g:e:k:t:dpc8gektg8w7:1002, value=Sheboygan, WI, timestamp=1428765102919171)
    -------------------
    RowKey: 9
    => (name=x:j:6:5:j:5:1:9xj65j518:1004, value=Denver, CO, timestamp=1428766022126266)
    
    2 Rows Returned.
    Elapsed time: 39 msec(s).
    

    But, I would recommend against using WITH COMPACT STORAGE for the following reasons:

    • You cannot add or remove columns after table creation.
    • It prevents you from having multiple non-key columns in the table.
    • It was really intended to be used in the old (deprecated) thrift-based approach to column family (table) modeling, and really shouldn't be used/needed anymore.
    • Yes, it saves you disk space, but disk space is cheap so I'd consider this a very small benefit.

    I know you said "other than cardinality of the partition key", but I am going to mention it here anyway. You'll notice in my sample data set, that almost all of my rows are stored with the d partition key value. If I were to create an application like this for myself, tracking geohashes in the Wisconsin/Illinois stateline area, I would definitely have the problem of most of my data being stored in the same partition (creating a hotspot in my cluster). So knowing my use case and potential data, I would probably combine the first three or so columns into a single partition key.

    The other issue with storing everything in the same partition key, is that each partition can store a max of about 2 billion columns. So it would also make sense to put some though behind whether or not your data could ever eclipse that mark. And obviously, the higher the cardinality of your partition key, the less likely you are to run into this issue.

    By looking at your question, it appears to me that you have looked at your data and you understand this...definite "plus." And 30 unique values in a partition key should provide sufficient distribution. I just wanted to spend some time illustrating how big of a deal that could be.

    Anyway, I also wanted to add a "nicely done," as it sounds like you are on the right track.

    Edit

    The still unresolved question for me is which approach will scale better, in which situations.

    Scalability is more tied to how many R replicas you have across N nodes. As Cassandra scales linearly; the more nodes you add, the more transactions your application can handle. Purely from a data distribution scenario, your first model will have a higher cardinality partition key, so it will distribute much more evenly than the second. However, the first model presents a much more restrictive model in terms of query flexibility.

    Additionally, if you are doing range queries within a partition (which I believe you said you are) then the second model will allow for that in a very performant manner. All data within a partition is stored on the same node. So querying multiple results for g1='d' AND g2='p'...etc...will perform extremely well.

    I may just have to play with the data more and run test cases.

    That is a great idea. I think you will find that the second model is the way to go (in terms of query flexibility and querying for multiple rows). If there is a performance difference between the two when it comes to single row queries, my suspicion is that it should be negligible.