Search code examples
cassandrastoragerocksdbleveldblsm-tree

How exactly does the memtable flush to an SSTable on disk in LSM-trees?


Implementation wise, how exactly does the memtable (in Cassandra, RocksDB, LevelDB, or any LSM-tree) flush to an SSTable?

I get that a memtable is some sorted data structured, like a red-black tree, but how do we turn that into a file of sorted key/value pairs? Do we iterate through the tree from the smallest key to the largest tree in a for-loop and insert the data one by one into an memory buffer (in SSTable format), and then write that to disk? Do we use some sort of tree-serialize method (if so, how is that still in SSTable format)? Can we just use a min-heap for the memtable and when flushing, keep getting the min-element and adding it to our array to flush?

I'm trying to understand the super specific details. I was looking at this file but was having a hard time understanding it: https://github.com/facebook/rocksdb/blob/fbfcf5cbcd3b09b6de0924d3c52a744a626135c0/db/flush_job.cc


Solution

  • You are correct.
    The memtable is looped over from smallest to largest and written out to file.

    In practicality there are other things as well written to the file but the foundation of the file is the section that has all the keys that were previously in the memtable. Such as bloom filters, seek sparse indices, and other metadata such as count, max key, min key

    You don't need a minheap. As the data is already sorted in the skiplist