Search code examples
scalacassandraapache-sparkspark-cassandra-connector

Spark-cassandra connector: select list of keys


Cassandra 2.1, Spark 1.1, spark-cassandra-connector 1.1

I have a very very tall Column Family of key, value pairs. And I also have an RDD of keys that I'd like to select from that CF

What I'd like to do is something like

import com.datastax.spark.connector._                                    

val ids = ...

val pairs = id.map{
 id => sc.cassandraTable("cf", "tallTable")
        .select("the_key", "the_val")
        .where("the_key = ?", id)
 }

However, referring to the the Spark Context in the map causes a NPE. I could make an RDD out of the full tallTable and then join on ids, however that is a very slow operation and I'd like to avoid it.

Is there a way to read a set of keys from Cassandra like this?


Solution

  • The spark-cassandra connector offers an optimized method to realize a join of an RDD of keys with a Cassandra table:

    // Given a collection of ids
    val ids = Seq(id,...)
    // Make an RDD out of it
    val idRdd = sc.parallelize(ids)
    // join the ids with the cassandra table to obtain the data specific to those ids
    val data = idRDD.joinWithCassandraTable("cf", "tallTable")
    

    This functionality is available from spark-cassandra connector v1.2 onwards so I'd recommend you to upgrade.