As Apache Spark is a suggested distributed processing engine for Cassandra, I know that there is a possibility to run Spark executors along with Cassandra nodes. My question is if the driver and Spark connector are smart enough to understand partitioning and shard allocation so data are processed in a hyper-converged manner.
Simply, does the executors read data stored from partitions that are hosted on nodes where an executor is running so no unnecessary data are transferred across the network as Spark does when it's run over HDFS?
Yes, Spark Cassandra Connector is able to do this. From the source code:
The
getPreferredLocations
method tells Spark the preferred nodes to fetch a partition from, so that the data for the partition are at the same node the task was sent to. If Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes. If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node.