I have a very huge Cassandra table with about 13 million entries. This table serves as a kind of a lookup table. That means there are no writes but only reads. I use Datastax Enterprise 4.8
(including Cassandra 2.1
).
So, the content is very static, but from time to time (every few month) there is an update of the content. The problem is, that the old data can become outdated and new data appears. But the old data won't be overwritten (it stays in the table). It is necessary to remove the old data to have a clean database.
I have one requirement ... the database must be available during the update. It is okay to have a short time period (a few minutes) where old and new data exists side by side.
I already thought about the following solutions:
Which of these solution is the best one? Or even better, is there a solution that solves my problem much more elegant?
Okay, after a lot of testing, here are my findings. All the mentioned measurements are based on 13 million datasets.
I have written a small Java tool that creates SSTables. Here you can find a good example how to do this with the CQLSSTableWriter. After the creation of the SSTable I have used the sstableloader command line tool (comes with Cassandra) to import it into Cassandra.
Conclusion
I have written a small Java tool that executes CQL commands to insert the datasets into Cassandra. Additionally, I added a version column, so after the import, I can remove the old data. The downside is, that my only partition key is the version itself, so I can remove old datasets easily. To workaround this, I indexed the table with Solr and use Solr queries to search in that table. The fact, that the data is not distributed between single nodes is okay for us, the search still works like a charm. At least the data is replicated between several nodes.
Conclusion
I will use the second solution because it is faster and you don't have to take care of the correct library versions. In all my tools I use threading, so here I also have a big adjusting screw to find the best balance between concurrency and threading overhead. At the end I use a low number of threads in my tool (~8) but the executeAsync method of the Datastax Java driver.