According to documentation, Cassandra Partitioner can help to reduce shuffles improving overall performance. To take advantage of partitioner I should use keyBy
method. Given table:
CREATE TABLE data_storage.dummy (
id text,
value bigint,
PRIMARY KEY (id)
)
I can query a table using RDD API and DataFrame API
val keySpace = "data_storage"
val table = "dummy"
//option 1
private val df: DataFrame = session.read.format("org.apache.spark.sql.cassandra")
.option("keyspace", keySpace)
.option("table", table)
.load
println(df.rdd.partitioner) //prints None
//option 2
val rdd = session.sparkContext.cassandraTable(keySpace, table).keyBy("id")
println(rdd.partitioner) //prints Some(CassandraPartitioner)
Is there any way to pass information to DataFrame reader about how data should be queried (something like keyBy()
method for DataFrame)
You don't need to specify partitioner in case of DataFrame. You just need to make sure pushdown
is set to true
for the Cassandra DataFrame.
Check this doc Automatic Predicate Pushdown and Column Pruning.