scala cassandra apache-spark spark-cassandra-connector

Spark-cassandra connector: select list of keys

Cassandra 2.1, Spark 1.1, spark-cassandra-connector 1.1

I have a very very tall Column Family of key, value pairs. And I also have an RDD of keys that I'd like to select from that CF

What I'd like to do is something like

import com.datastax.spark.connector._                                    

val ids = ...

val pairs = id.map{
 id => sc.cassandraTable("cf", "tallTable")
        .select("the_key", "the_val")
        .where("the_key = ?", id)
 }

However, referring to the the Spark Context in the map causes a NPE. I could make an RDD out of the full tallTable and then join on ids, however that is a very slow operation and I'd like to avoid it.

Is there a way to read a set of keys from Cassandra like this?

Solution

The spark-cassandra connector offers an optimized method to realize a join of an RDD of keys with a Cassandra table:

// Given a collection of ids
val ids = Seq(id,...)
// Make an RDD out of it
val idRdd = sc.parallelize(ids)
// join the ids with the cassandra table to obtain the data specific to those ids
val data = idRDD.joinWithCassandraTable("cf", "tallTable")

This functionality is available from spark-cassandra connector v1.2 onwards so I'd recommend you to upgrade.