I'm new to Cassandra and Pyspark, initially I installed cassandra version 3.11.1, openjdk 1.8, pyspark 3.x and scala 1.12. I was getting a lot of errors as shown below after running my python server.
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at com.datastax.spark.connector.util.ConfigParameter.<init>(ConfigParameter.scala:7)
at com.datastax.spark.connector.rdd.ReadConf$.<init>(ReadConf.scala:33)
at com.datastax.spark.connector.rdd.ReadConf$.<clinit>(ReadConf.scala)
at org.apache.spark.sql.cassandra.DefaultSource$.<init>(DefaultSource.scala:134)
at org.apache.spark.sql.cassandra.DefaultSource$.<clinit>(DefaultSource.scala)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 23 more
I didn't know what exactly this error is but after some research I realized that the pyspark Cassandra connection is having some issues. Then I checked the versions also. During my research I saw that Cassandra versions other than 4.x is not compatible with Python3.9. I uninstalled Cassandra and tried to install cassandra4 distribution but that is throwing me another set of errors after running the command:
wget http://mirror.cogentco.com/pub/apache/cassandra/4.0-beta2/apache-cassandra-4.0-beta2-bin.tar.gz
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
cassandra : Depends: python3 (>= 3.6) but 3.5.1-3 is to be installed
Recommends: ntp but it is not going to be installed or
time-daemon
E: Unable to correct problems, you have held broken packages.
Can someone help me understand the issue? How can I install Cassandra and Pyspark along with Python3.9. Is there any version incompatibility here?
updating question based on answer
I have updated my versions on another machine:
Currently, I'm using the following versions: Pyspark 3.0.1 Cassandra:4.0 cqlsh:5.0.1 python:3.6 Scala:2.12
I tried using the connector 3.0.0 as well as 3.1.0 both are giving me errors:
UNRESOLVED DEPENDENCY: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.datastax.spark#spark-cassandra-connector_2.12;3.0.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1389)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
.......
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Connection string used: --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell as pyspark version is 3.0.1 now.
You are using wrong version of Cassandra connector - if you are using pyspark 3.x, the you need to get corresponding version - 3.0 or 3.1. Your version is for old versions of Spark:
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
P.S. Cassandra 4.0 is also released already - it makes no sense use beta2