apache-spark pyspark amazon-emr apache-sedona

Apache Sedona on EMR version > 6.9.0: JavaPackage object is not callable

I am trying to run Apache Sedona 1.5.3 for Spark 3.4, on AWS EMR.

After following the instructions, I am getting the error,

  File "/usr/local/lib64/python3.7/site-packages/sedona/sql/dataframe_api.py", line 156, in validated_function
    return f(*args, **kwargs)
  File "/usr/local/lib64/python3.7/site-packages/sedona/sql/st_constructors.py", line 125, in ST_GeomFromWKT
    return _call_constructor_function("ST_GeomFromWKT", args)
  File "/usr/local/lib64/python3.7/site-packages/sedona/sql/dataframe_api.py", line 65, in call_sedona_function
    jc = jfunc(*args)
TypeError: 'JavaPackage' object is not callable

Which generally means that the jar package can't be found or used, or is the wrong version.

The instructions linked above were tested for EMR 6.9.0 for Spark 3.3.0; I am trying 6.14, to try to get to Spark 3.4. However, the instructions do note:

If you are using Spark 3.4+ and Scala 2.12, please use sedona-spark-shaded-3.4_2.12. Please pay attention to the Spark version postfix and Scala version postfix.

So it seems like Spark 3.4 should be OK. (Failing a response here, I'll try reverting to 3.3.0.)

In the Spark config, I specified,

"spark.yarn.dist.jars": "/jars/sedona-spark-shaded-3.4_2.12-1.5.3.jar,/jars/geotools-wrapper-1.5.3-28.2.jar",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
"spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"

which I think are the correct versions. I may be confused, since the instructions for spark 3.3.0 use sedona-spark-shaded-3.0_2.12-1.5.3.jar. Perhaps I don't understand what the 3.0/3.4 refer to. But based on this answer from the Sedona developers, I think I am setting this correctly.

My bootstrap downloads the jars and pip installs the directed packages. At the end, I ls -lh'ed to make sure they're there. I also wrote them to the user directory to try them with hadoop instead of root owner (see below):

-rw-r--r-- 1 root root 29M May 10 15:49 /jars/geotools-wrapper-1.5.3-28.2.jar
-rw-r--r-- 1 root root 21M May 10 15:49 /jars/sedona-spark-shaded-3.4_2.12-1.5.3.jar
-rw-rw-r-- 1 hadoop hadoop 29M May 10 15:49 /home/hadoop/custom/geotools-wrapper-1.5.3-28.2.jar
-rw-rw-r-- 1 hadoop hadoop 21M May 10 15:49 /home/hadoop/custom/sedona-spark-shaded-3.4_2.12-1.5.3.jar

In the pyspark script, I logged sedona.version to confirm that it is 1.5.3, and spark.sparkContext._conf.getAll() to confirm that it includes

('spark.yarn.dist.jars', 'file:/jars/sedona-spark-shaded-3.4_2.12-1.5.3.jar,file:/jars/geotools-wrapper-1.5.3-28.2.jar'

Other stuff I tried.

Since /jars wrote for root user rather than hadoop, I also loaded the jars to a local directory in /home/hadoop/. In previous versions of Sedona, I had used spark.driver.extraClassPath and spark.executor.extraClassPath to specify the downloaded jars. Finally, I tried just using --jars, even though this post said

spark.jars property is ignored for EMR on EC2 since it uses Yarn to deploy jars. See SEDONA-183

None of these things worked: they all gave the same 'JavaPackage' object is not callable error.

Semi-relevant tests.

Local tests with

packages = [
    "org.apache.sedona:sedona-spark-3.4_2.12:1.5.3",
    "org.datasyslab:geotools-wrapper:1.5.3-28.2",
]
conf = ( 
    SedonaContext.builder()
    .config("spark.jars.packages", ",".join(packages))
    [...]
    .config(
        "spark.jars.repositories",
        "https://artifacts.unidata.ucar.edu/repository/unidata-all",
    )
    .getOrCreate()
)
spark = SedonaContext.create(conf)

all run fine. Note the unshaded jar, in that context, as directed:

If you run Sedona in an IDE or a local Jupyter notebook, use the unshaded jar.

Further, in the EMR version, I don't create Spark, so I'm left to assume that the SedonaContext "works out." I removed the SedonaRegistrator.registerAll(spark) call, which is now deprecated and not required in the local version as above.

Solution

I believe that you need to replace SedonaRegistrator, not simply remove it. So you can just run

sedona_conf = SedonaContext.builder().getOrCreate()
spark = SedonaContext.create(sedona_conf)
test=spark.sql("SELECT ST_Point(double(1.2345), 2.3456)")
test.show()

in your EMR job, as you did local.