Search code examples
cassandratombstone

Do Cassandra inserts differeing in only a cluster key generate tombstones


How do Cassandra's handling of updates and cluster keys interact?

  • Cassandra never really updates records once written, it marks the old version as deleted using a tombstone and records both the old and new version, until the old version is eventually deleted by a housekeeping process: a form of garbage collection.
  • Cluster keys are implemented using some magic that records the data in a "real" record that has ony a partition key.

It strikes me that these two features might interact badly, causing generation of excessive garbage.

Consider this schema:

 CREATE TABLE t (
    p int,
    c int,
    d string,
    PRIMARY KEY ((p), c),
 );

After execution of the following insertions:

 INSERT INTO t (p, c, d) VALUE (1, 1, "text-1");
 INSERT INTO t (p, c, d) VALUE (1, 2, "text-2");

is there a tombstone-marked record holding the (1, 1, "text-1") data and a new record holding both the (1, 1, "text-1") and (1, 2, "text-2") data? That is, has the second insert been implemented as an update of the "real" record that has a partition key (p) of 1?


Solution

  • Your assumption is incorrect. In your schema, p is the partition (or "row") key, and c is a clustering column. Cassandra is a columnar store, so writes are essentially a collection of sparse, ordered columns attached to a partition. It's possible to achieve additional nesting by creating composite row keys and column names, which in your case translates to a storage model that looks like this:

    Row Key: 1 =>
      1:d => "text-1"
      2:d => "text-2" 
    

    If you were to insert another partition key, like this:

    INSERT INTO t (p, c, d) VALUE (2, 1, "text-1");
    

    your storage model would look like this:

    Row Key: 1 =>
      1:d => "text-1"
      2:d => "text-2" 
    Row Key: 2 =>
      1:d => "text-1"
    

    So you can observe that these column values (1:d, 2:d, etc), are treated independently. Suppose you then delete one of those values:

    DELETE FROM t WHERE p = 1 AND c = 1;
    

    your result would be:

    Row Key: 1 =>
      1:d => "text-1" + [tombstone]
      2:d => "text-2" 
    Row Key: 2 =>
      1:d => "text-1"
    

    where the tombstone would have a greater timestamp and therefore "cover" the original value, until compaction cleans this up. When exactly this occurs depends on a number of factors (value of gc_grace_seconds, compaction strategy, workload, etc).