Search code examples
scalaapache-sparkcassandraapache-spark-sqlspark-cassandra-connector

How to take advantage of Cassandra partitioner using DataFrames?


According to documentation, Cassandra Partitioner can help to reduce shuffles improving overall performance. To take advantage of partitioner I should use keyBy method. Given table:

CREATE TABLE data_storage.dummy (
id text,
value bigint,
PRIMARY KEY (id)
) 

I can query a table using RDD API and DataFrame API

  val keySpace = "data_storage"
  val table = "dummy"

  //option 1
  private val df: DataFrame = session.read.format("org.apache.spark.sql.cassandra")
    .option("keyspace", keySpace)
    .option("table", table)
    .load
  println(df.rdd.partitioner) //prints None

  //option 2
  val rdd = session.sparkContext.cassandraTable(keySpace, table).keyBy("id")
  println(rdd.partitioner) //prints Some(CassandraPartitioner)

Is there any way to pass information to DataFrame reader about how data should be queried (something like keyBy() method for DataFrame)


Solution

  • You don't need to specify partitioner in case of DataFrame. You just need to make sure pushdown is set to true for the Cassandra DataFrame. Check this doc Automatic Predicate Pushdown and Column Pruning.