Search code examples
javacassandradatastax-java-driver

how to read all row from very huge table in cassandra?


I have a Cassandra cluster with two node and replica_factor=2 in same datacenter. Table in ~150 million and continuously increasing that i need to read process and update corresponding row in Cassandra once in a day.

  • Is there any better approach to do this?

  • Is there any way to divide all row in parallel chunk and each chunk process by some thread?

  • Cassandra version: 2.2.1

  • java version: openjdk 1.7


Solution

  • You should have a look at Spark. Using the Spark Cassandra Connector allows you to read data from Cassandra from multiple Spark nodes that can be deployed additionally on the Cassandra nodes or in a separate cluster. Data is read, processed and written back in parallel by running a Spark job, which can also be scheduled for daily execution.

    As your data size is constantly growing, it would probably also make sense to look into Spark Streaming, allowing you to continually process and update your data, just based on the new data coming in. This would prevent reprocessing the same data over and over again, but it of course depends on your use-case if that's an option for you.