Search code examples
cassandrarediswal

Redis AOF fsync (ALWAYS) vs. LSM tree


My understanding of log structured merge trees (LSM trees) is that it takes advantage of the fact that appending to disk is very fast (since it requires no seeks) by just appending the update to a write-ahead log and returning to the client. My understanding was that this still provides immediate persistence, while still being extremely fast.

Redis, which I don't think uses LSM trees, seems to have a mode where you can AOF+fsync on every write. https://redis.io/topics/latency . The documentation says:

AOF + fsync always: this is very slow, you should use it only if you know what you are doing.

I'm confused why this would be very slow, since in principle you are still only appending to a file on every update, just like LSM-tree databases like Cassandra are doing.


Solution

  • LSM is AOF that you want to actually read sometimes. You do some overhead work so you can read it faster later. Redis is designed so you never or only in a special case read it. On the other hand, Cassandra often reads it to serve requests.

    And what Redis calls slow is actually very very fast for a db like Cassandra.

    ============================ UPDATE

    It turns out, I jumped into conclusions too early. From design standpoint everything above is true, but implementation differ so much. Despite Cassandra claiming absolute durability, it does not fsync on each transaction and there is no way to force it do so (but each transaction could be fsynced). The best I could do is 'fsync in batch mode at least 1ms after previous fsync'. It means for 4 thread benchmark I was using it was doing 4 writes per fsync and threads was waiting for fsync to be done. On the other hand, Redis did fsync on every write, so 4 times more often. With addition of more threads and more partitions of the table, Cassandra could win even bigger. But note, that the use case you described is not typical. And other architectural differences (Cassandra is good at partitioning, Redis is good at counters, LUA and other) still apply.

    Numbers:

    Redis command: set(KEY + (tstate.i++), TEXT);

    Cassandra command: execute("insert into test.test (id,data) values (?,?)", state.i++, TEXT)

    Where TEXT = "Wake up, Neo. We have updated our privacy policy."

    Redis fsync every sec, HDD

    Benchmark              (address)   Mode  Cnt      Score      Error  Units
    LettuceThreads.shared  localhost  thrpt   15  97535.900 ± 2188.862  ops/s
    
      97535.900 ±(99.9%) 2188.862 ops/s [Average]
      (min, avg, max) = (94460.868, 97535.900, 100983.563), stdev = 2047.463
      CI (99.9%): [95347.038, 99724.761] (assumes normal distribution)
    

    Redis fsync every write, HDD

    Benchmark              (address)   Mode  Cnt   Score   Error  Units
    LettuceThreads.shared  localhost  thrpt   15  48.862 ± 2.237  ops/s
    
      48.862 ±(99.9%) 2.237 ops/s [Average]
      (min, avg, max) = (47.912, 48.862, 56.351), stdev = 2.092
      CI (99.9%): [46.625, 51.098] (assumes normal distribution)
    

    Redis, fsync every write, NVMe (Samsung 960 PRO 1tb)

    Benchmark              (address)   Mode  Cnt    Score   Error  Units
    LettuceThreads.shared     remote  thrpt   15  449.248 ± 6.475  ops/s
    
      449.248 ±(99.9%) 6.475 ops/s [Average]
      (min, avg, max) = (441.206, 449.248, 462.817), stdev = 6.057
      CI (99.9%): [442.773, 455.724] (assumes normal distribution)
    

    Cassandra, fsync every sec,HDD

    Benchmark                  Mode  Cnt      Score     Error  Units
    CassandraBenchMain.write  thrpt   15  12016.250 ± 601.811  ops/s
    
      12016.250 ±(99.9%) 601.811 ops/s [Average]
      (min, avg, max) = (10237.077, 12016.250, 12496.275), stdev = 562.935
      CI (99.9%): [11414.439, 12618.062] (assumes normal distribution)
    

    Cassandra, fsync every batch, but wait at least 1ms, HDD

    Benchmark                  Mode  Cnt    Score   Error  Units
    CassandraBenchMain.write  thrpt   15  195.331 ± 3.695  ops/s
    
      195.331 ±(99.9%) 3.695 ops/s [Average]
      (min, avg, max) = (186.963, 195.331, 199.312), stdev = 3.456
      CI (99.9%): [191.637, 199.026] (assumes normal distribution)