apache-spark cassandra databricks spark-cassandra-connector

Cannot use joinWithCassandraTable in datatricks notebook

I'm using databricks notebook and I'm trying to use joinWithCassandraTable to join a RDD with a cassandra table.

The rdd is just a list of primary keys and I connect the cassandra using the CassandraConnector in the following ways:

sparkConf.set("spark.cassandra.connection.localDC", dc)
sparkConf.set("spark.cassandra.connection.host", ip)
sparkConf.set("spark.cassandra.auth.username", cassandraUsername)
sparkConf.set("spark.cassandra.auth.password", cassandraPassword)
val cassandraConnector = CassandraConnector(sparkConf)

However where I was trying to use joinWithCassandraTable, it seems the ip I set in the sparkConf does not take effect, it's still trying to connect to "localhost:9042".

Maybe it's because the joinWithCassandraTable is using the shared spark context in the notebook and any config change in the shared context will not take effect. Please advice on how to let joinWithCassandraTable to use the correct config.

Solution

The answer is because the joinWithCassandraTable is always using the built-in spare-context from the databricks notebook which is initialized before the notebook creation. You have no way to change the spark conf after that.

The correct way to use this function is to follow the guide in https://docs.databricks.com/data/data-sources/cassandra.html

Create a /databricks/init/$sparkClusterName/cassandra.sh script and put all your Cassandra related configs in it. This script will be called prior to the notebook creation.