apache-spark pyspark amazon-redshift amazon-emr apache-zeppelin

pyspark code working in console but not in zeppelin

I have an EMR (emr-5.28.0) with Spark 2.4.4 and python 2.7.16.

If I ssh to the cluster and execute pyspark like this:

pyspark --jars /home/hadoop/jar/spark-redshift_2.11-2.0.1.jar,/home/hadoop/jar/spark-avro_2.11-4.0.0.jar,/home/hadoop/jar/minimal-json-0.9.5.jar,/usr/share/aws/redshift/jdbc/RedshiftJDBC.jar --packages org.apache.spark:spark-avro_2.11:2.4.4

and execute this code:

url = "jdbc:redshift://my.cluster:5439/my_db?user=my_user&password=my_password"
query = "select * from schema.table where trunc(timestamp)='2019-09-10'"
df = sqlContext.read.format('com.databricks.spark.redshift')\
.option("url", url)\
.option("tempdir", "s3a://bucket/tmp_folder")\
.option("query", query)\
.option("aws_iam_role", "arn_iam_role")\
.load()

Everything works fine and I can work with that df. But if I open a Zeppelin notebook in the same EMR, with the same version of everything and execute a cell with:

%dep
z.load("/home/hadoop/jar/spark-redshift_2.11-2.0.1.jar")
z.load("/home/hadoop/jar/spark-avro_2.11-4.0.0.jar")
z.load("/home/hadoop/jar/minimal-json-0.9.5.jar")
z.load("/usr/share/aws/redshift/jdbc/RedshiftJDBC.jar")
z.load("org.apache.spark:spark-avro_2.11:2.4.4")

and in the next cell the same piece of code (startint with %pyspark), when I try to do a df.count() I get the following error:

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

I've tried to restart the interpreter several times, and I've tried to add to the interpreter args the --jar options that I use in the console when I ssh, but no luck. Any ideas??

Solution

I think this is an issue with how z.load works (or rather, doesn't work) for Pyspark queries.

Instead of loading your dependencies this way, go to settings -> interpreters, find pyspark and load your dependencies there, then restart the interpreter. This is the 'Zeppelin version' of --jars

Here's the official docs link to this - https://zeppelin.apache.org/docs/0.6.2/manual/dependencymanagement.html

I know that for Spark SQL z.deps doesn't work, so this may be the same issue.