Search code examples
apache-sparkpysparkapache-sedona

Receiving "Scala.MatchError" when running SQL query in a PySpark application with Apache Sedona, possibly caused by incompatible versions


I am creating a PySpark application to do the following:

  • Read a collection of points from a CSV file
  • Read a collection of polygons from a CSV file
  • Run an SQL query using ST_Intersects to determine where they overlap

Unfortunately, I am getting an error that looks something like this:

Py4JJavaError: An error occurred while calling o1539.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 56 in stage 255.0 failed 4 times, most recent failure: Lost task 56.3 in stage 255.0 (TID 874) (node5 executor 2): scala.MatchError: 40.74207644618617 (of class org.apache.spark.unsafe.types.UTF8String)
...
Caused by: scala.MatchError: 40.74207644618617 (of class org.apache.spark.unsafe.types.UTF8String)
    at org.apache.spark.sql.sedona_sql.expressions.ST_Point.eval(Constructors.scala:315)

After a lot of searching, the best suggestion I could find was that the versions of Spark, PySpark and Sedona that I am using may be incompatible with each other. I have been trying to use the latest versions of all these applications:

  • Hadoop 3.3.6 (on all nodes)
  • Java JDK 11.0.1 (This is the latest version supported by Hadoop)
  • Spark 3.5.0
  • Anaconda3-2023.09-0
  • PySpark 3.4.1
  • Apache Sedona 1.5

Solution

  • So, it turns out that Pyspark 3.5 uses Scala 2.12 (unless you specifically install the Scala 2.13 version) and I was using the Sedona jar file built for scala 2.13.

    So, to clarify, when you download sedona, you will see jar files for both versions of scala. Make sure that you copy the right jar file into your $SPARK_HOME/jars directory. your python code should look something like this when setting up your SparkSession:

    .config('spark.jars.packages',
                        'org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.5.0,' +
                        'org.datasyslab:geotools-wrapper:1.5.0-28.2'
                    )
    

    Unfortunately, I am now getting a different error when running an SQL query with an ST_Intersects ... But, I will ask a separate question since it seems unrelated to the Scala version.