Scenario: Cassandra is hosted on a server a.b.c.d
and spark runs on server say w.x.y.z
.
Assume i want to transform the data from a table(say table)casssandra and rewrite the same to other table(say tableNew) in cassandra using Spark,The code that i write looks something like this
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "a.b.c.d")
.set("spark.cassandra.auth.username", "<UserName>")
.set("spark.cassandra.auth.password", "<Password>")
val spark = SparkSession.builder().master("yarn")
.config(conf)
.getOrCreate()
val dfFromCassandra = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<table>", "keyspace" -> "<Keyspace>")).load()
val filteredDF = dfFromCassandra.filter(filterCriteria).write.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<tableNew>", "keyspace" -> "<Keyspace>")).save
Here filterCriteria
represents the transformation/filtering that I do. Iam not sure how Spark cassandra connector works in this case internally.
This is the Confusion that I have:
1: Does spark load the data from Cassandra source table to the memory and then filter the same and reload the same to the Target table Or
2: Does Spark cassandra connector convert the filter criteria to Where
clause and loads only the relevant data to form RDD and writes the same back to target table in Cassandra Or
3:Does the entire operation happens as a cql operation where the query is converted to sqllike query and is executed in cassandra itself?(I am almost sure that this is not what happens)
It is either 1. or 2. depending on your filterCriteria
. Naturally Spark itself can't do any CQL filtering but custom datasources can implement it using predicate pushdown. In case if Cassandra driver, it is implemented here and the answer depends if that covers the used filterCriteria
.