apache-spark cassandra spark-cassandra-connector

RDD joinWithCassandraTable

Can anyone please help me on the below query. I have an RDD with 5 columns. I want to join with a table in Cassandra. I knew that there is a way to do that by using "joinWithCassandraTable"

I see somewhere a syntax to use it. Syntax: RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb")) .on(SomeColumns("colc"))

Can anyone please send me the correct syntax??

I would like to actually know where to mention the column name of a table which is a key to join.

Solution

JoinWithCassandraTable works by pulling only the partition keys which match your RDD entries from C* so it only works on partition keys.

The documentation is here https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

and API Doc is here

http://datastax.github.io/spark-cassandra-connector/ApiDocs/1.6.0-M2/spark-cassandra-connector/#com.datastax.spark.connector.RDDFunctions

The jWCT table method can be used without the fluent api by specifying all the arguments in the method

def joinWithCassandraTable[R](
  keyspaceName: String, 
  tableName: String, 
  selectedColumns: ColumnSelector = AllColumns, 
  joinColumns: ColumnSelector = PartitionKeyColumns)

But the fluent api can also be used

joinWithCassandraTable[R](keyspace, tableName).select(AllColumns).on(PartitionKeyColumns)

These two calls are equivalent

Your example

RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb")) .on(SomeColumns("colc"))

Uses the Object from RDD to join against colc of tablename and only returns cola and colb as join results.