apache-kafka cassandra apache-storm cassandra-3.0

Copying data from one Cassandra table to another with TTL

We are changing partition key of one of our table by removing one column from partition key. Every record in this table also has TTL. Now we want to preserve the data in that table with TTL. How can we do it?

We can create new table with desired schema and then copy data from old table to new table. However, we loose TTL in this process.

For further information - This Cassandra table is populated by an Apache Storm application which reads events from Kafka. We can re-hydrate Kafka messages but Kafka has some unwanted messages which we don't want to process.

NOTE - TTL is decided based on date column value, which never changes. Because of this TTL would always be same on all the columns.

Solution

Before going to specific implementation, it's important to understand that TTL may exist on the individual cell as well as all cells in the row. And when you're performing INSERT or UPDATE operation, you can apply only one TTL value for all columns that are specified in the query, so if you have 2 columns with different TTLs, then you'll need to perform 2 queries - for each column, with different TTLs.

Regarding the tooling - there are 2 more or less ready-to-use options here:

Use DSBulk. This approach is described in details in the example 30.1 of this blog post. Basically, you need to unload data to disk using the query that will extract column values & TTLs for them, and then load data by generating batches for every column that have separate TTL. From example:

dsbulk unload -h localhost -query \
  "SELECT id, petal_length, WRITETIME(petal_length) AS w_petal_length, TTL(petal_length) AS l_petal_length, .... FROM dsbulkblog.iris_with_id" \
  -url /tmp/dsbulkblog/migrate
dsbulk load -h localhost -query \
  "BEGIN BATCH INSERT INTO dsbulkblog.iris_with_id(id, petal_length) VALUES (:id, :petal_length) USING TIMESTAMP :w_petal_length AND TTL :l_petal_length; ... APPLY BATCH;" \
  -url /tmp/dsbulkblog/migrate --batch.mode DISABLED

Use Spark Cassandra Connector - it supports reading & writing the data with TTL & WriteTime. But you'll need to develop the code that is doing it, and correctly handle things such as collections, static columns etc. (or wait for SPARKC-596 implemented)