python amazon-web-services apache-spark amazon-emr geospark

Getting GeoSpark error with upload_jars function

I'm trying to run GeoSpark in AWS EMR cluster. The code is:

#  coding=utf-8

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
from geospark.register import GeoSparkRegistrator
from geospark.utils import GeoSparkKryoRegistrator
from geospark.register import upload_jars

import config as cf

import yaml


if __name__ == "__main__":
    # Read files
    with open("/tmp/param.yml", 'r') as ymlfile:
        param = yaml.load(ymlfile, Loader=yaml.SafeLoader)
    
    # Register jars
    upload_jars()

    # Creation of spark session
    print("Creating Spark session")
    spark = SparkSession \
        .builder \
        .getOrCreate()
    
    GeoSparkRegistrator.registerAll(spark)

I get the following error in upload_jars() functions:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/findspark.py", line 143, in init
    py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "geo_processing.py", line 21, in <module>
    upload_jars()
  File "/usr/local/lib/python3.7/site-packages/geospark/register/uploading.py", line 39, in upload_jars
    findspark.init()
  File "/usr/local/lib/python3.7/site-packages/findspark.py", line 146, in init
    "Unable to find py4j, your SPARK_HOME may not be configured correctly"
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly

How can I solve this error?

Solution

You should remove upload_jars() from your code and instead load the jars in an alternative way, either by copying them to SPARK_HOME (at /usr/lib/spark as of emr-4.0.0) as part of an EMR bootstrap action or in your spark-submit command using the --jars option.

Explanation

I haven't been able to get the upload_jars() function to work on a multi-node EMR cluster. According to the geospark documentation, upload_jars():

uses findspark Python package to upload jar files to executor and nodes. To avoid copying all the time, jar files can be put in directory SPARK_HOME/jars or any other path specified in Spark config files.

Spark is installed in YARN mode on EMR, meaning it is only installed on the master node, and not the core/task nodes. So, findspark won't find Spark on the core/task nodes and so you get the error Unable to find py4j, your SPARK_HOME may not be configured correctly.