I am trying to undertand how to use repartitionByCassandraReplica functionality of spark-cassandra-connector in Kubernetes environment?
My initial thought is that hosting executor on the same host on which Cassandra pod is running will solve my problem. Am i right in my thinking?
Data locality can only be achieved with repartitionByCassandraReplica
if both the Spark worker/executor and Cassandra JVMs run in the same OSI. This applies to physical servers, VMs, containers, pods, etc.
Unless you have a way of running both the Spark and Cassandra image in the same container/pod, it won't be possible to achieve data locality.
For what it's worth, there's an open spark-cassandra-connector ticket to look into how this can be achieved (SPARKC-655). It's just a stub right now and there has not been any work done on it yet. Cheers!