apache-spark cassandra opscenter spark-cassandra-connector

Does compaction processes in C* influence on Spark jobs?

I`m using cassandra 2.1.5 (.469) with spark 1.2.1.

I had performed a migration job with spark on big C* table (2,034,065,959 rows)- migrating it to another schema table (new_table), using:

some_mapped_rdd.saveToCassandra("keyspace", "new_table", writeConf=WriteConf(parallelismLevel = 50))

I can see in OpsCenter/Activities that C* doing some compaction tasks on the new_table, and it is going on for few days.

In addition, I`m trying to run another job, while the compaction tasks is still on, using:

    //join with cassandra
    val rdd = some_array.map(x => SomeClass(x._1,x._2)).joinWithCassandraTable(keyspace, some_table)

    //get only the jsons and create rdd temp table
    val jsons = rdd.map(_._2.getString("this"))
    val jsonSchemaRDD = sqlContext.jsonRDD(jsons)
    jsonSchemaRDD.registerTempTable("this_json")

and it takes much longer then usual (usually I don`t perform huge migration tasks) to finish.

So does the compaction processes in C* influence on Spark jobs?

EDIT:

My table configured to SizeTieredCompactionStrategy (default) compaction strategy and I have 2882~ of 20M~ (and smaller, on 1 node out of 3) SSTable files, so I guess I should change the compaction_throughput_mb_per_sec parameter to higher value and go for DateTieredCompactionStrategy compaction strategy as my data is time series data.

Solution

In terms of compaction potentially using a lot of system resources it can influence your Spark Jobs from a performance standpoint. You can control how much throughput compactions can perform at a time via compaction_throughput_mb_per_sec.

On the other hand, reducing compaction throughput will make your compactions take longer to complete.

Additionally, the fact that compaction is happening could mean that how your data is distributed among sstables is not optimal. So it could be that compaction is a symptom of the issue, but not the actual issue. In fact it could be the solution to your problem (over time as it makes more progress).

I'd recommend taking a look the cfhistograms output of your tables you are querying to see how many SSTables are being hit per read. That could be a good indicator that something is unoptimal - such as needing to change your configuration (i.e. memtable flush rates) or optimize or change your compaction strategy.

This answer provides a good explanation on how to read cfhistograms output.