Search code examples
python-3.xapache-sparkpysparkcassandraspark-cassandra-connector

Cassandra with PySpark and Python >=3.6


I'm new to Cassandra and Pyspark, initially I installed cassandra version 3.11.1, openjdk 1.8, pyspark 3.x and scala 1.12. I was getting a lot of errors as shown below after running my python server.

raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.NoClassDefFoundError: scala/Product$class
        at com.datastax.spark.connector.util.ConfigParameter.<init>(ConfigParameter.scala:7)
        at com.datastax.spark.connector.rdd.ReadConf$.<init>(ReadConf.scala:33)
        at com.datastax.spark.connector.rdd.ReadConf$.<clinit>(ReadConf.scala)
        at org.apache.spark.sql.cassandra.DefaultSource$.<init>(DefaultSource.scala:134)
        at org.apache.spark.sql.cassandra.DefaultSource$.<clinit>(DefaultSource.scala)
        at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
        at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 23 more

I didn't know what exactly this error is but after some research I realized that the pyspark Cassandra connection is having some issues. Then I checked the versions also. During my research I saw that Cassandra versions other than 4.x is not compatible with Python3.9. I uninstalled Cassandra and tried to install cassandra4 distribution but that is throwing me another set of errors after running the command:

wget http://mirror.cogentco.com/pub/apache/cassandra/4.0-beta2/apache-cassandra-4.0-beta2-bin.tar.gz

    Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cassandra : Depends: python3 (>= 3.6) but 3.5.1-3 is to be installed
             Recommends: ntp but it is not going to be installed or
                         time-daemon
E: Unable to correct problems, you have held broken packages.

Can someone help me understand the issue? How can I install Cassandra and Pyspark along with Python3.9. Is there any version incompatibility here?

updating question based on answer

I have updated my versions on another machine:

Currently, I'm using the following versions: Pyspark 3.0.1 Cassandra:4.0 cqlsh:5.0.1 python:3.6 Scala:2.12

I tried using the connector 3.0.0 as well as 3.1.0 both are giving me errors:

UNRESOLVED DEPENDENCY: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1389)
        at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


.......
        raise Exception("Java gateway process exited before sending its port number")
    Exception: Java gateway process exited before sending its port number

Connection string used: --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell as pyspark version is 3.0.1 now.


Solution

  • You are using wrong version of Cassandra connector - if you are using pyspark 3.x, the you need to get corresponding version - 3.0 or 3.1. Your version is for old versions of Spark:

    pyspark --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
    

    P.S. Cassandra 4.0 is also released already - it makes no sense use beta2