apache-spark kubernetes cassandra spark-cassandra-connector

Can we use repartitionByCassandraReplica functionality of spark-cassandra-connector in kubernetes environment?

I am trying to undertand how to use repartitionByCassandraReplica functionality of spark-cassandra-connector in Kubernetes environment?

My initial thought is that hosting executor on the same host on which Cassandra pod is running will solve my problem. Am i right in my thinking?

Solution

Data locality can only be achieved with repartitionByCassandraReplica if both the Spark worker/executor and Cassandra JVMs run in the same OSI. This applies to physical servers, VMs, containers, pods, etc.

Unless you have a way of running both the Spark and Cassandra image in the same container/pod, it won't be possible to achieve data locality.

For what it's worth, there's an open spark-cassandra-connector ticket to look into how this can be achieved (SPARKC-655). It's just a stub right now and there has not been any work done on it yet. Cheers!