Search code examples
apache-sparkcassandradatastax-enterprisecassandra-3.0spark-cassandra-connector

why spark internally uses batch writes to Cassandra


I am new to spark , i am trying to understand Why does spark writes in batches to Cassandra (eg: savetocassandra operation) , when batches are not so efficient for all uses cases. What should be really taken care off from cassandra side or spark side , when we are doing spark job which reads from cassandra and writes back to cassandra , apart from optimizing the spark.cassandra properties.

Is it logged batched write or unlogged batch write ?


Solution

  • This is not very specific to Spark to Cassandra , but any process writing to service

    1. Spark writes to cassandra via API and not as file
    2. Batch always speed up the puts as in one API call you batch multiple rows to be put.
    3. Batching leads to difficult handling of exactly one semantics .
    4. You can always write your own Spark task to do one put at a time.
    5. I think single vs batch should be configurable