java datastax-enterprise cql3 cassandra-2.1

How to do huge updates in Cassandra without downtime?

I have a very huge Cassandra table with about 13 million entries. This table serves as a kind of a lookup table. That means there are no writes but only reads. I use Datastax Enterprise 4.8 (including Cassandra 2.1).

So, the content is very static, but from time to time (every few month) there is an update of the content. The problem is, that the old data can become outdated and new data appears. But the old data won't be overwritten (it stays in the table). It is necessary to remove the old data to have a clean database.

I have one requirement ... the database must be available during the update. It is okay to have a short time period (a few minutes) where old and new data exists side by side.

I already thought about the following solutions:

Write the new table directly as a SSTable and exchange it with the old one
Do the update as batch with an truncate of the old data at the beginning
Create a new table (with new name) and change the used table in the program (while running)
Add a version column, add new data with new version and delete old data (with old version) afterwards

Which of these solution is the best one? Or even better, is there a solution that solves my problem much more elegant?

Solution

Okay, after a lot of testing, here are my findings. All the mentioned measurements are based on 13 million datasets.

Write own SSTable

I have written a small Java tool that creates SSTables. Here you can find a good example how to do this with the CQLSSTableWriter. After the creation of the SSTable I have used the sstableloader command line tool (comes with Cassandra) to import it into Cassandra.

Conclusion

the creation of the SSTable goes pretty quick (~ 10 minutes)
the import of the SSTable is very slow (~ 6 hours)
you have to take care to you use the exact same java library version (cassandra-all.jar) then your Cassandra version, otherwise it can happen that the created SSTable is incompatible with Cassandra

Import with CQL and version column

I have written a small Java tool that executes CQL commands to insert the datasets into Cassandra. Additionally, I added a version column, so after the import, I can remove the old data. The downside is, that my only partition key is the version itself, so I can remove old datasets easily. To workaround this, I indexed the table with Solr and use Solr queries to search in that table. The fact, that the data is not distributed between single nodes is okay for us, the search still works like a charm. At least the data is replicated between several nodes.

Conclusion

the duration of the import is ok (~ 1.5 hours)
the load of the Cassandra nodes goes up heavily, I still have to investigate how this influences the experience of the "normal users" (but a quick check shows that this is still fine)

Result

I will use the second solution because it is faster and you don't have to take care of the correct library versions. In all my tools I use threading, so here I also have a big adjusting screw to find the best balance between concurrency and threading overhead. At the end I use a low number of threads in my tool (~8) but the executeAsync method of the Datastax Java driver.