Search code examples
solrcassandraopscenter

Deleting Huge Data In Cassandra Cluster


I have Cassandra cluster with three nodes. We have data close to 7 TB from last 4 years. Now because of less space available in the server, we would like to keep data only for last 2 years. But we don't want to delete it completely(data older than 2 years). We want to keep specific data even it is older than 2 years. Currently I can think of one approach: 1) Java client using "MutationBatch object". I can get all the records key which fall into date range and excluding rows which we don't want to delete. Then deleting records in a batch. But this solution raises concern over performance as data is huge.

Is it possible to handle it at the server level(opscenter). I read about TTL but how can I apply it to an existing data and also restrict some of the data which I want to keep even if it is older than 2 years.

Please help me in finding out the best solution.


Solution

  • The main thing that you need to understand is that when you remove the data in Cassandra, you're actually adding them by writing the tombstone, and then deletion of actual data will happen during compaction.

    So it's very important to perform deletion correctly. There are different types of deletes - individual cells, row, range, partition (from least effective to most effective by number of tombstones generated). The best for you is to remove by partition, then second one is by ranges inside partition. Following article describes how the data is removed in great details.

    You may need to perform deletion in several steps, so you don't add too much data as tombstones. You also need to check that you have enough disk space for compaction.