scala apache-spark cassandra classnotfoundexception spark-cassandra-connector

Spark running on Cassandra fails due to ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition (details inside)

I'm using spark 2.0.0 (local standalone) and spark-cassandra-connector 2.0.0-M1 with scala 2.11. I am working on a project on an IDE and everytime I run spark commands I get

ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
    at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
    at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

My build.sbt file

ibraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-M1"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"

So essentially it's an error message

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 13, 192.168.0.12): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition

The thing is if I run the spark shell with the spark-cassandra-connector with

$ ./spark-shell --jars /home/Applications/spark-2.0.0-bin-hadoop2.7/spark-cassandra-connector-assembly-2.0.0-M1-22-gab4eda2.jar

I can work with spark and cassandra with no error messages in the way.

Any clue on how to troubleshoot this strange incompatibility?

Edit:

This is interesting, from the worker node's perspective when I run a program, the connector gives

`java.io.InvalidClassException: com.datastax.spark.connector.rdd.CassandraTableScanRDD; local class incompatible: stream classdesc serialVersionUID = 1517205208424539072, local class serialVersionUID = 6631934706192455668`

that's what ultimately gives the ClassNotFound (it doesn't bind because of clash). But the project only ever used spark and connector 2.0 and scala 2.11, there is no version incompatibility anywhere.

Solution

In Spark just because you build against a library does not mean it will be included on the runtime classpath. If you added in

--jars  /home/Applications/spark-2.0.0-bin-hadoop2.7/spark-cassandra-connector-assembly-2.0.0-M1-22-gab4eda2.jar

To your spark-submit for your application it would include all of those necessary libraries at runtime and on all of the remote jvms.

So basically what you are seeing is that in the first example none of the connector libraries are on the runtime classpath, in the spark-shell example they are.