How do Cassandra's handling of updates and cluster keys interact?
It strikes me that these two features might interact badly, causing generation of excessive garbage.
Consider this schema:
CREATE TABLE t (
p int,
c int,
d string,
PRIMARY KEY ((p), c),
);
After execution of the following insertions:
INSERT INTO t (p, c, d) VALUE (1, 1, "text-1");
INSERT INTO t (p, c, d) VALUE (1, 2, "text-2");
is there a tombstone-marked record holding the (1, 1, "text-1")
data and a new record holding both the (1, 1, "text-1")
and (1, 2, "text-2")
data? That is, has the second insert been implemented as an update of the "real" record that has a partition key (p
) of 1?
Your assumption is incorrect. In your schema, p
is the partition (or "row") key, and c
is a clustering column. Cassandra is a columnar store, so writes are essentially a collection of sparse, ordered columns attached to a partition. It's possible to achieve additional nesting by creating composite row keys and column names, which in your case translates to a storage model that looks like this:
Row Key: 1 =>
1:d => "text-1"
2:d => "text-2"
If you were to insert another partition key, like this:
INSERT INTO t (p, c, d) VALUE (2, 1, "text-1");
your storage model would look like this:
Row Key: 1 =>
1:d => "text-1"
2:d => "text-2"
Row Key: 2 =>
1:d => "text-1"
So you can observe that these column values (1:d
, 2:d
, etc), are treated independently. Suppose you then delete one of those values:
DELETE FROM t WHERE p = 1 AND c = 1;
your result would be:
Row Key: 1 =>
1:d => "text-1" + [tombstone]
2:d => "text-2"
Row Key: 2 =>
1:d => "text-1"
where the tombstone would have a greater timestamp and therefore "cover" the original value, until compaction cleans this up. When exactly this occurs depends on a number of factors (value of gc_grace_seconds
, compaction strategy, workload, etc).