Search code examples
apache-sparkcassandrapysparkspark-cassandra-connector

How to create a spark dataframe with a Cassandra keyspace?


I have a local installation of Cassandra. I have to work in Spark with Google Colab and can run queries from my local database. But I know it is possible to connect spark and cassandra more efficiently. I would like to create a dataframe with data from a cassandra keyspace. How you do it?

My keyspace is called yelp_data. It contains the "reviews" and "business" tables.

In my project I would like a dataframe df = (data from my Cassandra keyspace). I use pyspark.


Solution

  • Just follow the documentation for Spark Cassandra Connector, and use spark.read with correct options, like this:

    reviews_df = spark.read.format("org.apache.spark.sql.cassandra")\
      .options(table="reviews", keyspace="yelp_data").load()
    business_df = spark.read.format("org.apache.spark.sql.cassandra")\
      .options(table="business", keyspace="yelp_data").load()