apache-spark cassandra spark-cassandra-connector

Remove Duplicates without shuffle Spark

I have a Cassandra table XYX with columns( id uuid, insert a timestamp, header text)

Where id and insert are composite primary key.

I'm using Dataframe and in my spark shell I'm fetching id and header column. I want to have distinct rows based on id and header column.

I'm seeing lot of shuffles which not be the case since Spark Cassandra connector ensures that all rows for a given Cassandra partition are in same spark partition.

After fetching I'm using dropDuplicates to get distinct records.

Solution

Spark Dataframe API does not support custom partitioners yet. So the Connector could not introduce the C* partitioner to Dataframe engine. An RDD Spark API supports custom partitioner from other hand. Thus you could load your data into RDD and then covert it to df. Here is a Connector doc about C* partitioner usage: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md

keyBy() function allow you to define key columns to use for grouping

Here is working example. It is not short, so I expect someone could improve it:

//load data into RDD and define a group key
val rdd = sc.cassandraTable[(String, String)] ("test", "test")
   .select("id" as "_1", "header" as "_2")
   .keyBy[Tuple1[Int]]("id")
// check that partitioner is CassandraPartitioner
rdd.partitioner
// call distinct for each group, flat it, get two column DF
val df = rdd.groupByKey.flatMap {case (key,group) => group.toSeq.distinct}
    .toDF("id", "header")