Search code examples
performancecassandrawhere-clausecassandra-3.0tombstone

Do tombstones in Cassandra slow down queries even when not selected with where clause


If I have a single partition with 100'000 deleted rows in one cluster followed by a second cluster in the same partition with no deleted rows, will the performance of doing a SELECT * FROM example_table WHERE partition=that_partition AND cluster=the_second_cluster be affected by the tombstones present in the_first_cluster?

I'm expecting that if the retrieval of row sets with a where clause is constant then Cassandra will just jump past all of the tombstones to the second cluster, but I don't understand how the where clause finds the correct row so I don't know if this is the case and I didn't manage to find anything online that could enlighten me.

// Example table
CREATE TABLE example_table (
  partition TEXT,
  cluster TEXT,
  value BLOB,

  PRIMARY KEY (partition, cluster);

// Example layout of rows in a table
partition      |cluster            |value
that_partition |the_first_cluster  |some_value1 // Deleted, a tombstone
that_partition |the_first_cluster  |some_value2 // Deleted, a tombstone
... 99'997 more similar tombstone rows
that_partition |the_first_cluster  |some_value  // Deleted, a tombstone
that_partition |the_second_cluster |some_valueA // Not a tombstone
that_partition |the_second_cluster |some_valueB // Not a tombstone
... no tombstones in the_second_cluster

Solution

  • A lot of tombstones on a partition will impact performance significantly IF its included in the result. A good write-up https://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets talks about it. Depending on the query, it may end up reading all 100,000 tombstones, and also possibly the original data if on a different sstable to satisfy the query. That generates a lot of garbage on the heap and will effect the JVMs GC along with a significant amount of CPU and IO for a single query.

    However if the tombstones are point deletes and not range tombstones, and your query goes directly to the partition + clustering of a key not deleted you will be Ok. It's a fine line though and I would recommend not attempting it (what if someone tries reading it out of app as a ops/test task? It could cause long GCs and negatively impact cluster). Range tombstones kept in the partition index are deserialized as part of reading where to jump to to get within the column index size of the row, so even if not directly reading them it can still significantly impact allocation rate depending on how your tombstone was inserted.

    There is a tombstone warn/failure threshold in cassandra.yaml that is set so it will let you know if your query is hitting them but it can be hard to tell until you hit failure point and queries die since just reported in logs.

    I would recommend you time box your partitions to limit number of tombstones in each one.