Cassandra 2.1, Spark 1.1, spark-cassandra-connector 1.1
I have a very very tall Column Family of key, value pairs. And I also have an RDD of keys that I'd like to select from that CF
What I'd like to do is something like
import com.datastax.spark.connector._
val ids = ...
val pairs = id.map{
id => sc.cassandraTable("cf", "tallTable")
.select("the_key", "the_val")
.where("the_key = ?", id)
}
However, referring to the the Spark Context in the map causes a NPE. I could make an RDD out of the full tallTable and then join on ids, however that is a very slow operation and I'd like to avoid it.
Is there a way to read a set of keys from Cassandra like this?
The spark-cassandra connector offers an optimized method to realize a join of an RDD of keys with a Cassandra table:
// Given a collection of ids
val ids = Seq(id,...)
// Make an RDD out of it
val idRdd = sc.parallelize(ids)
// join the ids with the cassandra table to obtain the data specific to those ids
val data = idRDD.joinWithCassandraTable("cf", "tallTable")
This functionality is available from spark-cassandra connector v1.2 onwards so I'd recommend you to upgrade.