I am trying to use spark to do some simple computations on Cassandra tables, but I am quite lost.
I am trying to follow: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md
So I'm running the PySpark shell: with
./bin/pyspark \
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3
But I am not sure how to set things up from here. How do I let Spark know where my Cassandra cluster is? I've seen that CassandraSQLContext
can be used for this, but I also read that this is deprecated.
I have read this: How to connect spark with cassandra using spark-cassandra-connector?
But if I use
import com.datastax.spark.connector._
Python says that it can't find the module. Can someone maybe point me in the right direction on how to set things up properly?
Cassandra connector doesn't provide any Python modules. All functionality is provided with Data Source API and as long as required jars are present, everything should work out of the box.
How do I let Spark know where my Cassandra cluster is?
Use spark.cassandra.connection.host
property. You can for exampel pass it as an argument for spark-submit
/ pyspark
:
pyspark ... --conf spark.cassandra.connection.host=x.y.z.v
or set in your configuration:
(SparkSession.builder
.config("cassandra.connection.host", "x.y.z.v"))
Configuration like table name or keyspace can be set directly on reader:
(spark.read
.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="test", cluster="cluster")
.load())
So you can follows Dataframes documentation.
As a side note
import com.datastax.spark.connector._
is a Scala syntax and is accepted in Python only accidentally.