Search code examples
apache-sparkcassandraspark-cassandra-connector

What happens with Spark partitions when using Spark-Cassandra-Connector


So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 for doing a repartitionByCassandraReplica.JoinWithCassandraTable and then some SparkML analysis takes place. My question is what happens eventually with the spark partitions?

1st scenario

My PartitionsPerHost parameter of repartitionByCassandraReplica is numberofSelectedCassandraPartitionkeys which means if I choose 4 partition keys I get 4 partitions per Host. This gives me 64 spark partitions because I have 16 hosts.

2nd scenario

But, according to the Spark Cassandra connector documentation, information from system.size_estimates table should be used in order to calculate the spark partitions. For example from my system.size_estimates:

estimated_table_size = mean_partition_size x number_of_partitions
                 = (24416287.87/1000000) MB x 332
                 = 8106.2 MB

spark_partitions = estimated_table_size / input.split.size_in_mb
             = 8106.2 MB / 64 MB
             = 126.6593 partitions

so, when does the 1st scenario takes place and when the second? Am I calculating something wrong? Is there specific cases where the 1st scenario happens and other cases the 2nd?


Solution

  • Those are two completely different paths by which the number of Spark partitions are calculated.

    If you're calling repartitionByCassandraReplica(), the number of Spark partitions are determined by both partitionsPerHost and the number of Cassandra nodes in the local DC.

    Otherwise, the connector will use input.split.size_in_mb to determine the number of Spark partitions based on the estimated table size. Cheers!