Search code examples
cassandradatastax-enterprisecassandra-3.0

Which one is better to use TTL or Delete in Cassandra?


I want to remove records from Cassandra cluster after a particular time. So what Should I use TTL or manually delete?


Solution

  • It depends on your data model. The fortunate answer, is that due to their predictable nature, you can build your data model to accommodate TTLs.

    Let's say I build the following table to track user requests to a REST service, for example. Suppose that I really only care about the last week's worth of data, so I'll set a TTL of 604800 seconds (7 days). So the query I need to support is basically this (querying transactions for user 'Bob' for the prior 7 days):

    SELECT * FROM rest_transactions_by_user 
      WHERE username='Bob' AND transaction_time > '2018-05-28 13:41';
    

    To support that query, I'll build this table:

    CREATE TABLE rest_transactions_by_user (
      username TEXT,
      transaction_time TIMESTAMP,
      service_name TEXT,
      HTTP_result BIGINT,
      PRIMARY KEY (username,transaction_time))
      WITH CLUSTERING ORDER BY (transaction_time DESC)
      AND gc_grace_seconds = 864000      
      AND default_time_to_live = 604800;
    

    A few things to note:

    • I am leaving gc_grace_seconds at the default of 864000 (ten days). This will ensure that the TTL tombstones will have adequate time to be propagated throughout the cluster.
    • Rows will TTL at 7 days (as mentioned above). After that, they become tombstones for an additional 10 days.
    • I am clustering by transaction_time in DESCending order. This puts the rows I care about (the ones that haven't TTL'd) at the "top" of my partition (sequentially).
    • By querying for a transaction_time of the prior 7 days, I am ignoring anything older than that. As my TTL tombstones will exist for 10 days afterward, they will be at the "bottom" of my partition.

    In this way, limiting my query to the last 7 days ensures that Cassandra will never have to deal with the tombstones, as my query will never find them. So in this case, I have built a data model where a TTL is "better" than a random delete.