Search code examples
cassandra

Why does my read operation go to SSTable when updated data is in Memtable?


I have data in the format of (id, data), such as (1, "someDataS").

  • Initially, when I insert data, it is stored in the Memtable, and reads pull directly from the Memtable.
  • After more data is inserted, it flushes to the SSTable. At this point, reads start retrieving the data from the SSTable, which makes sense.

However, I’m confused about what happens after updating older data that is already in the SSTable.

For example, if I update a data item that is currently in the SSTable, I expect the Memtable to hold the new version, while the older version remains in the SSTable. But when I perform a read after this update, it still checks the SSTable, even though a newer version should be in the Memtable.

Question: Why doesn’t the read operation return the updated data directly from the Memtable, where the latest version is stored? Is there a reason it still checks the SSTable?

I used query tracing feature to debug it, It led me to believe the relevant code is in following file https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java

more specific "queryMemtableAndSSTablesInTimestampOrder" method.To me it looks like, it always checks sstable.


Solution

  • In the wider user case - it is not necessarily possible to know from just the memtable that there is nothing within the sstable that you do not need.

    Examples:

    • The memtable contains a subset of the columns, the others were set in a previous operation.
    • The memtable could contain part of an unfrozen collection, and the sstable has the rest of the collection.
    • The sstable can have data with a future writetimestamp that can supercede the data within the memtable. (Written using the using timestamp syntax)
    • The sstable can have deletion with a future tombstone, rendering the data in the memtable deleted. (Again using the using timestamp syntax)

    The last 2 specifically mean no scenario allows for a micro-optimisation based on the table schema where you can eliminate the potential of the first 2 scenarios.