Search code examples
apache-sparkcassandrarddspark-cassandra-connector

Spark CassandraTableScanRDD KeyBy not retaining all columns


CASSANDRA_TABLE has (some_other_column, itemid) as primary key.

val cassandraRdd: CassandraTableScanRDD[CassandraRow] = sparkSession.sparkContext
  .cassandraTable(cassandraKeyspace, cassandraTable)

cassandraRdd.take(10).foreach(println)

This cassandraRdd has all columns read from my cassandra table

val temp1: CassandraTableScanRDD[((String), CassandraRow)] = cassandraRdd
  .select("itemid", "column2", "column3")
  .keyBy[(String)]("itemid")
val temp2: CassandraTableScanRDD[((String), CassandraRow)] = cassandraRdd
  .keyBy[(String)]("itemid")
temp1.take(10).foreach(println)
temp2.take(10).foreach(println)

Both temp1 and temp2 are not retaining all columns after that keyBy operation

((988230014),CassandraRow{itemid: 988230014})

How can I keyBy on certain column and have CassandraRow retain all columns?


Solution

  • To retain partitioner and get selected rows I have to read cassandra rows something like this below

    val cassandraRdd: CassandraTableScanRDD[((String, String), (String, String, String))] = {
      sparkSession.sparkContext
        .cassandraTable[(String, String, String)](cassandraKeyspace, cassandraTable)
        .select("some_other_column" as "_1", "itemid" as "_2", "column3" as "_3", "some_other_column", "itemid")
        .keyBy[(String, String)]("some_other_column", "itemid")
    }