Search code examples
apache-sparkcassandraspark-cassandra-connector

RDD joinWithCassandraTable


Can anyone please help me on the below query. I have an RDD with 5 columns. I want to join with a table in Cassandra. I knew that there is a way to do that by using "joinWithCassandraTable"

I see somewhere a syntax to use it. Syntax: RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb")) .on(SomeColumns("colc"))

Can anyone please send me the correct syntax??

I would like to actually know where to mention the column name of a table which is a key to join.


Solution

  • JoinWithCassandraTable works by pulling only the partition keys which match your RDD entries from C* so it only works on partition keys.

    The documentation is here https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

    and API Doc is here

    http://datastax.github.io/spark-cassandra-connector/ApiDocs/1.6.0-M2/spark-cassandra-connector/#com.datastax.spark.connector.RDDFunctions

    The jWCT table method can be used without the fluent api by specifying all the arguments in the method

    def joinWithCassandraTable[R](
      keyspaceName: String, 
      tableName: String, 
      selectedColumns: ColumnSelector = AllColumns, 
      joinColumns: ColumnSelector = PartitionKeyColumns)
    

    But the fluent api can also be used

    joinWithCassandraTable[R](keyspace, tableName).select(AllColumns).on(PartitionKeyColumns)
    

    These two calls are equivalent

    Your example

    RDD.joinWithCassandraTable(KEYSPACE, tablename, SomeColumns("cola","colb")) .on(SomeColumns("colc"))
    

    Uses the Object from RDD to join against colc of tablename and only returns cola and colb as join results.